876214 – Gluster "healed" but client gets i/o error on file.

Bug 876214 - Gluster "healed" but client gets i/o error on file.

Summary: Gluster "healed" but client gets i/o error on file.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jeff Darcy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	876222 (view as bug list)
Depends On:
Blocks:	878878
TreeView+	depends on / blocked

Reported:	2012-11-13 15:15 UTC by Rob.Hendelman
Modified:	2013-01-21 17:52 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	878878 (view as bug list)
Environment:
Last Closed:	2013-01-21 17:52:11 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rob.Hendelman 2012-11-13 15:15:48 UTC

Description of problem:
After gluster "heals" file, client gets (Posix?) I/O error.

Version-Release number of selected component (if applicable):
3.3.1

How reproducible:
Not sure

Steps to Reproduce:
1. Setup 2 nodes in replicate scenario.
2. Connect client to gluster filesystem & start rsyncing.
3. Stop gluster service on one node, client continues rsyncing.
4. Wait 10 minutes & re-start service & immediately stop gluster service on 2nd server.
5.  Repeat Step #4 several times.
6.  Stop client rysnc & run heal and then full heal.

Actual results:
========
evprodglx01 - server
========
root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info heal-failed
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

At the brick level:
root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# openssl md5 .bash_history 
MD5(.bash_history)= f10869ab49cd3a76513d598a16129d23
root@evprodglx01:/mnt/gluster/data/bricks/1/data/home/alisa# stat .bash_history 
  File: `.bash_history'
  Size: 3201      	Blocks: 16         IO Block: 4096   regular file
Device: fc00h/64512d	Inode: 120848615   Links: 2
Access: (0600/-rw-------)  Uid: ( 1605/   alisa)   Gid: ( 1000/  magcap)
Access: 2012-11-12 14:27:33.779455251 -0600
Modify: 2012-08-17 14:42:55.000000000 -0500
Change: 2012-11-12 09:28:02.721846226 -0600
 Birth: -


========
drglx01 - server
========
root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# gluster volume heal data info heal-failed
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

At the brick level:

root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# openssl md5 ./.bash_history 
MD5(./.bash_history)= f10869ab49cd3a76513d598a16129d23
root@drglx01:/mnt/gluster/data/bricks/1/data/home/alisa# stat .bash_history 
  File: `.bash_history'
  Size: 3201      	Blocks: 16         IO Block: 4096   regular file
Device: fc00h/64512d	Inode: 156109053   Links: 2
Access: (0600/-rw-------)  Uid: ( 1605/   alisa)   Gid: ( 1000/  magcap)
Access: 2012-11-12 14:27:33.747118350 -0600
Modify: 2012-08-17 14:42:55.000000000 -0500
Change: 2012-11-12 10:20:54.421711315 -0600
 Birth: -

===========
ev-henderolx01 (client)
===========
evprodglx01:/data on /mnt/gluster/data type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)

hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ stat .bash_history 
stat: cannot stat `.bash_history': Input/output error

hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ ls -la .bash_history 
ls: cannot access .bash_history: Input/output error

hendero@ev-henderolx01:/mnt/gluster/data/data/home/alisa$ cd /

hendero@ev-henderolx01:/$ sudo umount /mnt/gluster/data
[sudo] password for hendero: 

hendero@ev-henderolx01:/$ sudo mount -t glusterfs evprodglx01:/data /mnt/gluster/data

hendero@ev-henderolx01:/$ stat /mnt/gluster/data/data/home/alisa/.bash_history 
stat: cannot stat `/mnt/gluster/data/data/home/alisa/.bash_history': Input/output error

hendero@ev-henderolx01:/$ ls -la /mnt/gluster/data/data/home/alisa/
ls: cannot access /mnt/gluster/data/data/home/alisa/.bashrc: Input/output error
ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_logout: Input/output error
ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_profile: Input/output error
ls: cannot access /mnt/gluster/data/data/home/alisa/.viminfo: Input/output error
ls: cannot access /mnt/gluster/data/data/home/alisa/.bash_history: Input/output error
total 24
drwxr-xr-x   6 alisa magcap 4096 Aug  8 14:06 .
drwxr-xr-x 147 root  magcap 4096 Nov  9 08:56 ..
drwxrwxrwx   2 root  root   4096 Jun  8 09:23 allocClient
??????????   ? ?     ?         ?            ? .bash_history
??????????   ? ?     ?         ?            ? .bash_logout
??????????   ? ?     ?         ?            ? .bash_profile
??????????   ? ?     ?         ?            ? .bashrc
drwx------   2 alisa magcap 4096 Jun  7 14:26 .cache
drwxr-xr-x   2 alisa magcap 4096 Jun  7 15:28 dev
drwxr-xr-x   2 alisa magcap 4096 Jun  7 15:30 t0
??????????   ? ?     ?         ?            ? .viminfo

Expected results:
A heal with no errors to not return errors to the client(s).

Additional info:

Comment 1 Jeff Darcy 2012-11-13 15:46:26 UTC

I strongly suspect that those files are in "split brain" - changes unique to both, impossible for us to reconcile, requiring manual intervention.  You can check this by looking at the xattrs on the copies, e.g.

  getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc

If those both show non-zero values, then we're in split brain.  What do you think we should do?  Fail the self-heal which might have been for hundreds of files, leaving no indication of which file(s) we couldn't heal?  Returning a single status for multiple operations is a well known Hard Problem, and we do the best we can; we let the heal succeed, indicating the status of the scanning process, and then log the failures in various ways.  "gluster heal ... info split-brain" should show which file(s) failed.  I've also submitted patches to aid in manual reconciliation:

  http://review.gluster.org/#change,4132

If you want to *prevent* split brain, which is actually better than trying to deal with it after artificially inducing it, you could try turning on quorum enforcement.

  http://hekafs.org/index.php/2011/11/quorum-enforcement/
  http://hekafs.org/index.php/2012/11/different-forms-of-quorum/ (future)

Lastly, you might want to track the status of bug 873962, wherein we're dealing with a very similar scenario and ways to deal with it more gracefully.

Comment 2 Rob.Hendelman 2012-11-13 19:25:59 UTC

Gluster says it's not split-brained....
=====================
root@evprodglx01:~# gluster volume heal data info split-brain
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

root@drglx01:~# gluster volume heal data info split-brain
Gathering Heal info on volume data has been successful

Brick evprodglx01:/mnt/gluster/data/bricks/1
Number of entries: 0

Brick drglx01:/mnt/gluster/data/bricks/1
Number of entries: 0
=====================

Here is the result of running the getfattr command:
root@evprodglx01:~# getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/data/bricks/1/data/home/alisa/.bashrc
trusted.afr.data-client-0=0x000000000000000000000000
trusted.afr.data-client-1=0x000000000000000000000000
trusted.gfid=0x3ab32077b1c84edaae2303027ab24648

root@drglx01:~# getfattr -d -e hex -m . /mnt/gluster/data/bricks/1/data/home/alisa/.bashrc
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/data/bricks/1/data/home/alisa/.bashrc
trusted.afr.data-client-0=0x000000000000000000000000
trusted.afr.data-client-1=0x000000000000000000000000
trusted.gfid=0xff6a57c4bca2459b89c3e02249b33d16

I will read up on those other links.  Thanks for the quick response!

Robert

Comment 3 Rob.Hendelman 2012-11-13 20:18:26 UTC

Those links were very good, thank you.  

I'm not sure having a quorum would help us in our test case (unless I'm  misreading the 3rd link) because we have 1 brick+1 server in our test case.

Thanks,

Robert

Comment 4 Jeff Darcy 2012-11-13 23:22:40 UTC

Hm.  It's not classic split-brain, but a relative: GFID mismatch.

  trusted.gfid=0x3ab32077b1c84edaae2303027ab24648
  trusted.gfid=0xff6a57c4bca2459b89c3e02249b33d16

It seems like somehow rsync is retrying a create on one server that actually already succeeded on the other server, so we end up with two files instead of one.  Need to think about that one a bit.

Comment 5 Rob.Hendelman 2012-11-14 13:42:45 UTC

Re Comment3: 

What I meant was that we have a pure replicate setup:

1 Server(1 brick) -> 1 Server(1 brick)

Comment 6 Rob.Hendelman 2012-11-19 19:37:16 UTC

I think I know why this is happening.  Imagine this scenario -

Rsync client starts copying to a replicated setup

a) Server 1 + 2 get files 1-4
b) Server 1 crashes while server 2 gets files 5-8.
c) rsync is stopped/cancelled.
d) Server 1 comes back up and receives file5 from server2
e) Server 2 goes down
e) rsync is continued....Server 1 now gets files 5-8 from the rsync client.
f) Server 2 comes back up.

Server 1 & 2 now have the same files 5-8 with (most likely) different gfids.  Rsync, when re-started, has copied (at least) files 6-8 anew as it doesn't see them existing on Server1 since the replication/healing didn't finish before the rsync was continued.

Comment 7 Jeff Darcy 2012-12-05 11:43:15 UTC

*** Bug 876222 has been marked as a duplicate of this bug. ***

Comment 8 Jeff Darcy 2013-01-21 17:52:11 UTC

(In reply to comment #6)
> Server 1 & 2 now have the same files 5-8 with (most likely) different gfids.
> Rsync, when re-started, has copied (at least) files 6-8 anew as it doesn't
> see them existing on Server1 since the replication/healing didn't finish
> before the rsync was continued.

As I examine this with fresh eyes, it looks like this is a pretty classic "split brain in time" scenario.  In other words, the split brain happens not because of a network partition but because of alternating availability of the two servers.  There's not really that much we can do about that, short of marking an entire previously-down brick as "bad" unless/until it completes a full self-heal cycle.  That could be a long process, and in extreme cases might never complete as updates continue to occur at the still-good brick faster than they can be propagated.  Ultimately, availability could end up being worse than it is now, and as a result is not planned.

I'm going to mark this as WONTFIX, but with a suggestion that prevention is better than cure.  If you enable either of the quorum-enforcement features available in newer versions, then it becomes much harder to get into this situation.  There's client-side replica-set-level quorum enforcement, controlled by the following options.

    cluster.quorum-type
    cluster.quorum-count

There's also server-side cluster-level quorum enforcement, controlled by these options.

    cluster.server-quorum-type
    cluster.server-quorum-ratio

If http://review.gluster.org/#change,4363 is accepted, then the server-side quorum will also be able to measure quorum across the servers for a volume instead of across all servers in the cluster.  Note that 4363 also supports specification of an "arbiter" node that doesn't have any data for the volume but can break quorum ties.

Note You need to log in before you can comment on or make changes to this bug.