Bug 810944 - glusterfs nfs mounts hang when second node is down
Summary: glusterfs nfs mounts hang when second node is down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: nfs
Version: 3.2.5
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Vivek Agarwal
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-09 17:22 UTC by Kevin Fenzi
Modified: 2016-02-18 00:02 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-28 10:55:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
glusterdump file from nfs server when other node is down. (103.71 KB, text/plain)
2012-04-09 17:25 UTC, Kevin Fenzi
no flags Details
logs from first node (1.27 MB, application/gzip)
2012-04-09 19:16 UTC, seth vidal
no flags Details
logs from second node (1.01 MB, application/gzip)
2012-04-09 19:16 UTC, seth vidal
no flags Details

Description Kevin Fenzi 2012-04-09 17:22:30 UTC
Two node setup with replication. 
Both nodes nfs mount the volume and work as expected. 
However, when one node is down, the other node's nfs mount hangs and is unusable. ;( 


# gluster volume info hosted

Volume Name: hosted
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: hosted01.fedoraproject.org:/glusterfs/hosted
Brick2: hosted02.fedoraproject.org:/glusterfs/hosted
Options Reconfigured:
nfs.rpc-auth-allow: 127.0.0.1
nfs.disable: off
performance.io-thread-count: 32


localhost:/hosted /srv nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,nolock,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=127.0.0.1 0 0


root      2678  0.1  0.1  43260 11488 ?        Ssl  16:56   0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid
root      2730  0.1  0.1  60132 12176 ?        Ssl  16:56   0:01 /usr/sbin/glusterfsd --xlator-option hosted-server.listen-port=24009 -s localhost --volfile-id hosted.hosted02.fedoraproject.org.glusterfs-hosted -p /etc/glusterd/vols/hosted/run/hosted02.fedoraproject.org-glusterfs-hosted.pid -S /tmp/b940ce67b4f7f4925e9d979a03f2d34a.socket --brick-name /glusterfs/hosted --brick-port 24009 -l /var/log/glusterfs/bricks/glusterfs-hosted.log
root      2734  0.2  0.4  62088 29884 ?        Ssl  16:56   0:02 /usr/sbin/glusterfs -f /etc/glusterd/nfs/nfs-server.vol -p /etc/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log

I'll attach a kill -USR1 from the nfs server when the second node is down in case that shows anything. 

Happy to gather more info, try things, etc.

Comment 1 Kevin Fenzi 2012-04-09 17:25:13 UTC
Created attachment 576265 [details]
glusterdump file from nfs server when other node is down.

Comment 2 seth vidal 2012-04-09 19:16:11 UTC
Created attachment 576288 [details]
logs from first node

Comment 3 seth vidal 2012-04-09 19:16:39 UTC
Created attachment 576289 [details]
logs from second node

Comment 4 Kevin Fenzi 2012-04-09 21:44:18 UTC
Additional info: 

Stopping all gluster* processes on one node works fine and results in the other node working as expected.

Taking down the second node completely, results in the nfs hang we are seeing. 
(or while it's rebooting). 

If one node is powered off, the other node does eventually start working again. it's after about 16 minutes: 

[2012-04-09 21:08:59.239152] I [client-handshake.c:913:client_setvolume_cbk] 0-hosted-client-1: Connected to 66.135.62.187:24009, attached to remote volume '/glusterfs/hosted'.
[2012-04-09 21:08:59.239183] I [client-handshake.c:779:client_post_handshake] 0-hosted-client-1: 1 fds open - Delaying child_up until they are re-opened
[2012-04-09 21:08:59.316107] I [client-lk.c:617:decrement_reopen_fd_count] 0-hosted-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
<node was powered off here and no logs until:>
[2012-04-09 21:26:43.834287] W [socket.c:204:__socket_rwv] 0-hosted-client-1: readv failed (No route to host)

When one node is powered off, you can see on the other one: 
tcp        0   6204 192.168.122.2:1022          66.135.62.187:24009         ESTABLISHED 

When one node is powered off, ping from the other: 
PING hosted02.fedoraproject.org (66.135.62.187) 56(84) bytes of data.
From hosted02.fedoraproject.org (66.135.62.187) icmp_seq=1 Destination Host Unreachable

Some info about the crazy network setup: 

node1: 192.168.122.2 (libvirt private network guest, with all 66.135.62.201 traffic DNAT and SNATed into it)
node2: 192.168.122.3 (libvirt private network guest with all 66.135.62.187 traffic DNAT and SNATed into it). 

I suspect the issue here is that something in the nat setup is not sending back what's expected, so things have to wait for the normal tcp timeout on the established connection before it gives up, but I would expect glusterd to be able to bypass that and call the connection down when it sees the host unreachable...

Comment 5 Krishna Srinivas 2012-04-11 08:13:51 UTC
I see messages like these in the log files:

[2012-04-08 17:31:53.849090] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-hosted-client-1: server 66.135.62.187:24009 has not responded in the last 42 seconds, disconnecting.

which means it has detected that the peer is no longer active and bails the connection as expected. This will make sure that the nfs server no longer hangs and continues. We need to see why it is hanging for you more than 42 seconds. Can you tell us the time when it hangs so that we can check the corresponding log file messages?


[2012-04-09 21:08:59.316107] I [client-lk.c:617:decrement_reopen_fd_count]
0-hosted-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
<node was powered off here and no logs until:>
[2012-04-09 21:26:43.834287] W [socket.c:204:__socket_rwv] 0-hosted-client-1:
readv failed (No route to host)


client_ping_timer is only active if the connection is actively being used. Since, probably, this connection is inactive (no pending frames) client_ping_timer is not coming into the picture.

Comment 6 Kevin Fenzi 2012-04-11 16:04:46 UTC
ok, here are some fresh logs.

- Both nodes up and connected and working properly. 
- One node is halted at: 
Wed Apr 11 15:45:11 UTC 2012
- Other node logs: 
[2012-04-11 15:45:26.17611] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (66.135.62.187:24007)
[2012-04-11 15:45:26.18372] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (66.135.62.187:1023)
[2012-04-11 15:45:48.307313] E [socket.c:1685:socket_connect_finish] 0-management: connection to 66.135.62.187:24007 failed (Connection timed out)

- df and other access of nfs mount now hangs on that node. 

netstat shows: 

tcp        0    468 192.168.122.2:1022          66.135.62.187:24009         ESTABLISHED 

(connection still established to down node). 

- 42seconds later, nothing changes. no log addtions or access to nfs mount. 
- 15minutes and 38 seconds later there is a flurry of log messages: 

[2012-04-11 16:01:26.305] W [socket.c:204:__socket_rwv] 0-hosted-client-1: readv failed (No route to host)
[2012-04-11 16:01:26.609] W [socket.c:1494:__socket_proto_state_machine] 0-hosted-client-1: reading from socket failed. Error (No route to host), peer (66.135.62.187:24009)
[2012-04-11 16:01:26.44905] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:45:33.885394
[2012-04-11 16:01:26.45037] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.45619] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:46:10.89206
[2012-04-11 16:01:26.45705] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.45910] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:46:33.884480
[2012-04-11 16:01:26.45983] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.46181] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:48:33.884500
[2012-04-11 16:01:26.46490] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.46753] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:49:33.884493
[2012-04-11 16:01:26.46826] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.47023] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:51:33.884496
[2012-04-11 16:01:26.47095] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.47477] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:52:33.884513
[2012-04-11 16:01:26.47553] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.47757] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:54:33.884525
[2012-04-11 16:01:26.47830] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.48025] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:55:33.884508
[2012-04-11 16:01:26.48095] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.48502] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:57:33.884504
[2012-04-11 16:01:26.48578] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.48781] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 15:58:33.884497
[2012-04-11 16:01:26.48852] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.49045] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-04-11 16:00:01.948006
[2012-04-11 16:01:26.49141] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.49811] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(STATFS(14)) called at 2012-04-11 16:00:33.884507
[2012-04-11 16:01:26.49887] I [client3_1-fops.c:637:client3_1_statfs_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.50085] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x86) [0xe6a6c6] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88) [0xe69d98] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0x27) [0xe69cf7]))) 0-hosted-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-04-11 16:01:01.949335
[2012-04-11 16:01:26.50157] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-hosted-client-1: remote operation failed: Transport endpoint is not connected
[2012-04-11 16:01:26.50589] I [client.c:1883:client_rpc_notify] 0-hosted-client-1: disconnected
[2012-04-11 16:01:40.530324] E [socket.c:1685:socket_connect_finish] 0-hosted-client-1: connection to 66.135.62.187:24009 failed (No route to host)

After that the nfs mount starts working and operating as normal. 

- netstat no longer shows the ESTABLISHED entry. 

Happy to gather more debugging.

Comment 7 Jeff Darcy 2012-04-12 13:13:45 UTC
This seems almost like what would happen if (a) we were doing a non-blocking read on the socket and (b) we weren't getting the ICMP Host Unreachable for ~10 minutes.  It's possible that (b) might be related to the funky VM setup, but it shouldn't matter because (a) shouldn't be the case anyway.  AFAICT we do try to set NBIO by default, so if we're not or if it's failing then we should probably look into that.  Another possibility would be to try the keepalive options, and see if they cause the connection to fail sooner.

Comment 8 Kevin Fenzi 2012-04-12 14:42:17 UTC
How can we try the keepalive options? ;)

Comment 9 Kaleb KEITHLEY 2012-11-21 19:00:25 UTC
Seems to work okay for me with 3.2.7 and 3.3.1.

Approx 42 seconds after knocking node two off-line (qemu/kvm, forced off) writes on the nfs mounted volume resumed.

Comment 10 Vivek Agarwal 2013-08-28 10:55:59 UTC
Please reopen if this is still a problem.


Note You need to log in before you can comment on or make changes to this bug.