Bug 825197 - ping_pong hangs on nfs mount
Summary: ping_pong hangs on nfs mount
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: nfs
Version: 3.3-beta
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Vinayaga Raman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-25 10:48 UTC by Shwetha Panduranga
Modified: 2014-03-31 01:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-30 07:09:00 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Shwetha Panduranga 2012-05-25 10:48:42 UTC
Version-Release number of selected component (if applicable):
------------------------------------------------------------
3.3.0qa43

How reproducible:
-----------------
often

Steps to Reproduce:
-------------------
1.Create a replicate volume with 3 bricks 
2.create 6 nfs mounts 
3.start executing "ping_pong file1 7" on each nfs mount. 
  
Actual results:
---------------
ping_pong hangs on each mount when we start executing ping_pong on the mounts.

Expected results:
-----------------
ping_pong should run successfully.

Comment 1 Krishna Srinivas 2012-05-26 14:35:57 UTC
There seems to be mem leak in NLM. The nfs process got killed after a while. In your setup was nfs process still alive? did you check? Is this hang reproducible in your setup without replicate?

Comment 2 Shwetha Panduranga 2012-05-28 05:50:53 UTC
ping_pong on a file hangs on plain distribute volume also. 

Valgrind logs:-
-------------
==7014==    Use --log-fd=<number> to select an alternative log fd.
==7014== Warning: invalid file descriptor 1017 in syscall close()
==7014== Warning: invalid file descriptor 1018 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Thread 7:
==7006== Syscall param write(buf) points to uninitialised byte(s)
==7006==    at 0x36386D846D: ??? (in /lib64/libc-2.12.so)
==7006==    by 0x363870EF0A: writetcp (in /lib64/libc-2.12.so)
==7006==    by 0x363871592D: xdrrec_endofrecord (in /lib64/libc-2.12.so)
==7006==    by 0x363870ECF3: clnttcp_call (in /lib64/libc-2.12.so)
==7006==    by 0x981DF2D: nsm_monitor (nlm4.c:551)
==7006==    by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so)
==7006==    by 0xCA266FF: ???
==7006==  Address 0x671acd8 is 88 bytes inside a block of size 8,004 alloc'd
==7006==    at 0x4A05FDE: malloc (vg_replace_malloc.c:236)
==7006==    by 0x36387151CD: xdrrec_create (in /lib64/libc-2.12.so)
==7006==    by 0x363870EA42: clnttcp_create (in /lib64/libc-2.12.so)
==7006==    by 0x363870D953: clnt_create (in /lib64/libc-2.12.so)
==7006==    by 0x981DE6F: nsm_monitor (nlm4.c:543)
==7006==    by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so)
==7006==    by 0xCA266FF: ???
==7006==

Comment 3 Krishna Srinivas 2012-05-28 08:00:00 UTC
In your setup was nfs process still alive when ping_pong hangs?

Comment 4 Rajesh 2012-05-28 10:33:08 UTC
yes, the nfs process as well as the brick(s) are alive and listening (gdb bt showed them at epoll_wait). wireshark on one of the clients showed NLM_BLOCKED as the last reply from server.

I tried the same with 6 mounts on personal vm and local machine being the server. it worked fine. I suspect network issue, but ping-pong on fuse mounts contradict the same. need further investigation.

Comment 5 Krishna Srinivas 2012-07-30 07:09:00 UTC
ping_pong was being run on a client machine which was behind NAT. For locking to work fine the client machine's NLM service needs to be reachable by server machine's NLM service.


Note You need to log in before you can comment on or make changes to this bug.