Version-Release number of selected component (if applicable): ------------------------------------------------------------ 3.3.0qa43 How reproducible: ----------------- often Steps to Reproduce: ------------------- 1.Create a replicate volume with 3 bricks 2.create 6 nfs mounts 3.start executing "ping_pong file1 7" on each nfs mount. Actual results: --------------- ping_pong hangs on each mount when we start executing ping_pong on the mounts. Expected results: ----------------- ping_pong should run successfully.
There seems to be mem leak in NLM. The nfs process got killed after a while. In your setup was nfs process still alive? did you check? Is this hang reproducible in your setup without replicate?
ping_pong on a file hangs on plain distribute volume also. Valgrind logs:- ------------- ==7014== Use --log-fd=<number> to select an alternative log fd. ==7014== Warning: invalid file descriptor 1017 in syscall close() ==7014== Warning: invalid file descriptor 1018 in syscall close() ==7006== Warning: invalid file descriptor -1 in syscall close() ==7006== Warning: invalid file descriptor -1 in syscall close() ==7006== Warning: invalid file descriptor -1 in syscall close() ==7006== Thread 7: ==7006== Syscall param write(buf) points to uninitialised byte(s) ==7006== at 0x36386D846D: ??? (in /lib64/libc-2.12.so) ==7006== by 0x363870EF0A: writetcp (in /lib64/libc-2.12.so) ==7006== by 0x363871592D: xdrrec_endofrecord (in /lib64/libc-2.12.so) ==7006== by 0x363870ECF3: clnttcp_call (in /lib64/libc-2.12.so) ==7006== by 0x981DF2D: nsm_monitor (nlm4.c:551) ==7006== by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so) ==7006== by 0xCA266FF: ??? ==7006== Address 0x671acd8 is 88 bytes inside a block of size 8,004 alloc'd ==7006== at 0x4A05FDE: malloc (vg_replace_malloc.c:236) ==7006== by 0x36387151CD: xdrrec_create (in /lib64/libc-2.12.so) ==7006== by 0x363870EA42: clnttcp_create (in /lib64/libc-2.12.so) ==7006== by 0x363870D953: clnt_create (in /lib64/libc-2.12.so) ==7006== by 0x981DE6F: nsm_monitor (nlm4.c:543) ==7006== by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so) ==7006== by 0xCA266FF: ??? ==7006==
In your setup was nfs process still alive when ping_pong hangs?
yes, the nfs process as well as the brick(s) are alive and listening (gdb bt showed them at epoll_wait). wireshark on one of the clients showed NLM_BLOCKED as the last reply from server. I tried the same with 6 mounts on personal vm and local machine being the server. it worked fine. I suspect network issue, but ping-pong on fuse mounts contradict the same. need further investigation.
ping_pong was being run on a client machine which was behind NAT. For locking to work fine the client machine's NLM service needs to be reachable by server machine's NLM service.