Description of problem: After gluster "heals" file, client gets (Posix?) I/O error. Version-Release number of selected component (if applicable): 3.3.1 How reproducible: Not sure Steps to Reproduce: 1. Setup 2 nodes in replicate scenario. 2. Connect client to gluster filesystem & start rsyncing. 3. Stop gluster service on one node, client continues rsyncing. 4. Wait 10 minutes & re-start service & immediately stop gluster service on 2nd server. 5. Repeat Step #4 several times. 6. Stop client rysnc & run heal and then full heal. Actual results: ======== evprodglx01 - server ======== root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info heal-failed Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 At the brick level: root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# openssl md5 .bash_history MD5(.bash_history)= f10869ab49cd3a76513d598a16129d23 root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# stat .bash_history File: `.bash_history' Size: 3201 Blocks: 16 IO Block: 4096 regular file Device: fc00h/64512d Inode: 120848615 Links: 2 Access: (0600/-rw-------) Uid: ( 1605/ alisa) Gid: ( 1000/ magcap) Access: 2012-11-12 14:27:33.779455251 -0600 Modify: 2012-08-17 14:42:55.000000000 -0500 Change: 2012-11-12 09:28:02.721846226 -0600 Birth: - ======== drglx01 - server ======== root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info heal-failed Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 At the brick level: root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# openssl md5 ./.bash_history MD5(./.bash_history)= f10869ab49cd3a76513d598a16129d23 root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# stat .bash_history File: `.bash_history' Size: 3201 Blocks: 16 IO Block: 4096 regular file Device: fc00h/64512d Inode: 156109053 Links: 2 Access: (0600/-rw-------) Uid: ( 1605/ alisa) Gid: ( 1000/ magcap) Access: 2012-11-12 14:27:33.747118350 -0600 Modify: 2012-08-17 14:42:55.000000000 -0500 Change: 2012-11-12 10:20:54.421711315 -0600 Birth: - =========== ev-henderolx01 (client) =========== evprodglx01:/data on /mnt/gluster/data type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ stat .bash_history stat: cannot stat `.bash_history': Input/output error hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ ls -la .bash_history ls: cannot access .bash_history: Input/output error hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ cd / hendero@ev-henderolx01:/$ sudo umount /mnt/gluster/data [sudo] password for hendero: hendero@ev-henderolx01:/$ sudo mount -t glusterfs evprodglx01:/data /mnt/gluster/data hendero@ev-henderolx01:/$ stat /mnt/gluster/data/data/home/alisa/.bash_history stat: cannot stat `/mnt/gluster/data/data/home/alisa/.bash_history': Input/output error hendero@ev-henderolx01:/$ ls -la /mnt/gluster/data/data/home/alisa/ ls: cannot access /mnt/gluster/data/data/home/alisa/.bashrc: Input/output error ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_logout: Input/output error ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_profile: Input/output error ls: cannot access /mnt/gluster/data/data/home/alisa/.viminfo: Input/output error ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_history: Input/output error total 24 drwxr-xr-x 6 alisa magcap 4096 Aug 8 14:06 . drwxr-xr-x 147 root magcap 4096 Nov 9 08:56 .. drwxrwxrwx 2 root root 4096 Jun 8 09:23 allocClient ?????????? ? ? ? ? ? .bash_history ?????????? ? ? ? ? ? .bash_logout ?????????? ? ? ? ? ? .bash_profile ?????????? ? ? ? ? ? .bashrc drwx------ 2 alisa magcap 4096 Jun 7 14:26 .cache drwxr-xr-x 2 alisa magcap 4096 Jun 7 15:28 dev drwxr-xr-x 2 alisa magcap 4096 Jun 7 15:30 t0 ?????????? ? ? ? ? ? .viminfo Expected results: A heal with no errors to not return errors to the client(s). Additional info:
I strongly suspect that those files are in "split brain" - changes unique to both, impossible for us to reconcile, requiring manual intervention. You can check this by looking at the xattrs on the copies, e.g. getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc If those both show non-zero values, then we're in split brain. What do you think we should do? Fail the self-heal which might have been for hundreds of files, leaving no indication of which file(s) we couldn't heal? Returning a single status for multiple operations is a well known Hard Problem, and we do the best we can; we let the heal succeed, indicating the status of the scanning process, and then log the failures in various ways. "gluster heal ... info split-brain" should show which file(s) failed. I've also submitted patches to aid in manual reconciliation: http://review.gluster.org/#change,4132 If you want to *prevent* split brain, which is actually better than trying to deal with it after artificially inducing it, you could try turning on quorum enforcement. http://hekafs.org/index.php/2011/11/quorum-enforcement/ http://hekafs.org/index.php/2012/11/different-forms-of-quorum/ (future) Lastly, you might want to track the status of bug 873962, wherein we're dealing with a very similar scenario and ways to deal with it more gracefully.
Gluster says it's not split-brained.... ===================== root@evprodglx01:~# gluster volume heal data info split-brain Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 root@drglx01:~# gluster volume heal data info split-brain Gathering Heal info on volume data has been successful Brick evprodglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 Brick drglx01:/mnt/gluster/data/bricks/1 Number of entries: 0 ===================== Here is the result of running the getfattr command: root@evprodglx01:~# getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc getfattr: Removing leading '/' from absolute path names # file: mnt/gluster/data/bricks/1/data/home/alisa/.bashrc trusted.afr.data-client-0=0x000000000000000000000000 trusted.afr.data-client-1=0x000000000000000000000000 trusted.gfid=0x3ab32077b1c84edaae2303027ab24648 root@drglx01:~# getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc getfattr: Removing leading '/' from absolute path names # file: mnt/gluster/data/bricks/1/data/home/alisa/.bashrc trusted.afr.data-client-0=0x000000000000000000000000 trusted.afr.data-client-1=0x000000000000000000000000 trusted.gfid=0xff6a57c4bca2459b89c3e02249b33d16 I will read up on those other links. Thanks for the quick response! Robert
Those links were very good, thank you. I'm not sure having a quorum would help us in our test case (unless I'm misreading the 3rd link) because we have 1 brick+1 server in our test case. Thanks, Robert
Hm. It's not classic split-brain, but a relative: GFID mismatch. trusted.gfid=0x3ab32077b1c84edaae2303027ab24648 trusted.gfid=0xff6a57c4bca2459b89c3e02249b33d16 It seems like somehow rsync is retrying a create on one server that actually already succeeded on the other server, so we end up with two files instead of one. Need to think about that one a bit.
Re Comment3: What I meant was that we have a pure replicate setup: 1 Server(1 brick) -> 1 Server(1 brick)
I think I know why this is happening. Imagine this scenario - Rsync client starts copying to a replicated setup a) Server 1 + 2 get files 1-4 b) Server 1 crashes while server 2 gets files 5-8. c) rsync is stopped/cancelled. d) Server 1 comes back up and receives file5 from server2 e) Server 2 goes down e) rsync is continued....Server 1 now gets files 5-8 from the rsync client. f) Server 2 comes back up. Server 1 & 2 now have the same files 5-8 with (most likely) different gfids. Rsync, when re-started, has copied (at least) files 6-8 anew as it doesn't see them existing on Server1 since the replication/healing didn't finish before the rsync was continued.
*** Bug 876222 has been marked as a duplicate of this bug. ***
(In reply to comment #6) > Server 1 & 2 now have the same files 5-8 with (most likely) different gfids. > Rsync, when re-started, has copied (at least) files 6-8 anew as it doesn't > see them existing on Server1 since the replication/healing didn't finish > before the rsync was continued. As I examine this with fresh eyes, it looks like this is a pretty classic "split brain in time" scenario. In other words, the split brain happens not because of a network partition but because of alternating availability of the two servers. There's not really that much we can do about that, short of marking an entire previously-down brick as "bad" unless/until it completes a full self-heal cycle. That could be a long process, and in extreme cases might never complete as updates continue to occur at the still-good brick faster than they can be propagated. Ultimately, availability could end up being worse than it is now, and as a result is not planned. I'm going to mark this as WONTFIX, but with a suggestion that prevention is better than cure. If you enable either of the quorum-enforcement features available in newer versions, then it becomes much harder to get into this situation. There's client-side replica-set-level quorum enforcement, controlled by the following options. cluster.quorum-type cluster.quorum-count There's also server-side cluster-level quorum enforcement, controlled by these options. cluster.server-quorum-type cluster.server-quorum-ratio If http://review.gluster.org/#change,4363 is accepted, then the server-side quorum will also be able to measure quorum across the servers for a volume instead of across all servers in the cluster. Note that 4363 also supports specification of an "arbiter" node that doesn't have any data for the volume but can break quorum ties.