Description of problem: When in a good working Gluster setup the network between the bricks and clients gets unreliable (like dis and reconnects), sometimes the connection of a client to a (single) brick gets stalled. The client is not able to connect to port 24007 anymore. Version-Release number of selected component (if applicable): Ubuntu 12.04, kernel 3.2.0-55-generic, iptables 1.4.12 How reproducible: Not always. Steps to Reproduce: 1. Have a working two brick setup (client 10.243.72.0, brick1 10.243.0.23 and brick1 10.243.0.24). 2. Make network temporarily unreliable (dis and reconnect network between client and bricks). 3. Look at the connection table and find connections hanging in SYN_SENT (client) SYN_RECV (brick1). Note that in our environment, only Gluster suffers from a temporary network outage. Other services recover painlessly. Actual results: $ sudo lsof -n | grep gluster | grep TCP glusterfs 31120 root 6u IPv4 3723188 0t0 TCP 10.243.72.0:1023->10.243.0.23:24007 (SYN_SENT) glusterfs 31120 root 9u IPv4 590750 0t0 TCP 10.243.72.0:1019->10.243.0.24:49153 (ESTABLISHED) glusterfs 31120 root 10u IPv4 590752 0t0 TCP 10.243.72.0:1018->10.243.0.23:49153 (ESTABLISHED) glusterfs 31161 root 5u IPv4 3723912 0t0 TCP 10.243.72.0:1022->10.243.0.23:24007 (SYN_SENT) glusterfs 31161 root 6u IPv4 3723800 0t0 TCP 10.243.72.0:1021->10.243.0.23:24007 (SYN_SENT) glusterfs 31161 root 9u IPv4 590792 0t0 TCP 10.243.72.0:1011->10.243.0.24:49152 (ESTABLISHED) Expected results: $ sudo lsof -n | grep gluster | grep TCP glusterfs 31120 root 6u IPv4 3773211 0t0 TCP 10.243.72.0:1023->10.243.0.23:24007 (ESTABLISHED) glusterfs 31120 root 9u IPv4 590750 0t0 TCP 10.243.72.0:1019->10.243.0.24:49153 (ESTABLISHED) glusterfs 31120 root 10u IPv4 590752 0t0 TCP 10.243.72.0:1018->10.243.0.23:49153 (ESTABLISHED) glusterfs 31161 root 5u IPv4 3773262 0t0 TCP 10.243.72.0:1022->10.243.0.23:24007 (ESTABLISHED) glusterfs 31161 root 6u IPv4 3773227 0t0 TCP 10.243.72.0:1020->10.243.0.23:49152 (ESTABLISHED) glusterfs 31161 root 9u IPv4 590792 0t0 TCP 10.243.72.0:1011->10.243.0.24:49152 (ESTABLISHED) Additional info: tcpdump client: 22:45:46.824406 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105351456 ecr 0,nop,wscale 6], length 0 22:45:46.826581 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342443 ecr 105351456,nop,wscale 6], length 0 22:45:47.820857 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105351706 ecr 0,nop,wscale 6], length 0 22:45:47.821407 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342692 ecr 105351456,nop,wscale 6], length 0 22:45:47.856419 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342701 ecr 105351456,nop,wscale 6], length 0 22:45:49.824862 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105352207 ecr 0,nop,wscale 6], length 0 22:45:49.825382 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949343193 ecr 105351456,nop,wscale 6], length 0 22:45:49.857252 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949343201 ecr 105351456,nop,wscale 6], length 0 22:45:53.828828 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105353208 ecr 0,nop,wscale 6], length 0 22:45:53.834738 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949344195 ecr 105351456,nop,wscale 6], length 0 22:45:54.056148 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949344251 ecr 105351456,nop,wscale 6], length 0 tcpdump brick1: 22:45:46.826170 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105351456 ecr 0,nop,wscale 6], length 0 22:45:46.826234 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342443 ecr 105351456,nop,wscale 6], length 0 22:45:47.821010 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105351706 ecr 0,nop,wscale 6], length 0 22:45:47.821068 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342692 ecr 105351456,nop,wscale 6], length 0 22:45:47.855840 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949342701 ecr 105351456,nop,wscale 6], length 0 22:45:49.825002 IP 10.243.0.72.1021 > 10.243.0.23.24007: Flags [S], seq 3579779243, win 14600, options [mss 1460,sackOK,TS val 105352207 ecr 0,nop,wscale 6], length 0 22:45:49.825063 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949343193 ecr 105351456,nop,wscale 6], length 0 22:45:49.855843 IP 10.243.0.23.24007 > 10.243.0.72.1021: Flags [S.], seq 4035724036, ack 3579779244, win 14480, options [mss 1460,sackOK,TS val 2949343201 ecr 105351456,nop,wscale 6], length 0 After setting /proc/sys/net/netfilter/nf_conntrack_log_invalid to 255, on the client I get: Nov 28 23:06:48 app12 kernel: [422967.072664] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 28 23:06:48 app12 kernel: [422967.471720] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 28 23:06:55 app12 kernel: [422974.528586] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1023 SEQ=2730434121 ACK=2316443699 WINDOW=14480 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080AAFD03BE2064C516A01030306) Nov 28 23:06:55 app12 kernel: [422974.528602] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1023 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 28 23:06:55 app12 kernel: [422974.872535] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1023 SEQ=2730434121 ACK=2316443699 WINDOW=14480 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080AAFD03C38064C516A01030306) Nov 28 23:06:55 app12 kernel: [422974.872551] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1023 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 28 23:06:56 app12 kernel: [422975.088646] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 SEQ=2158185757 ACK=1702267848 WINDOW=14480 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080AAFD03C6E064C51F601030306) Nov 28 23:06:56 app12 kernel: [422975.088660] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 28 23:06:56 app12 kernel: [422975.471702] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 SEQ=2158185757 ACK=1702267848 WINDOW=14480 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080AAFD03CCE064C51F601030306) Nov 28 23:06:56 app12 kernel: [422975.471718] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:87:d5:aa:01:60:00:78:82:08:00 SRC=10.243.0.23 DST=10.243.0.72 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Note the "SEQ is over the upper bound" warning. Work arounds: - umount problematic share and remount it (not feasable in production environment), or - set net.ipv4.netfilter.ip_conntrack_tcp_be_liberal to 1 Suggested improvements: - Document the need to set net.ipv4.netfilter.ip_conntrack_tcp_be_liberal in some cases, or - Let the gluster client try connections from different source port (currently always 1020-1024 ?) might help? - ?
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
GlusterFS 3.4.x has reached end-of-life.\If this bug still exists in a later release please reopen this and change the version or open a new bug.
GlusterFS 3.4.x has reached end-of-life. If this bug still exists in a later release please reopen this and change the version or open a new bug.