Bug 984602 - [FEAT] Add explicit brick affinity
Summary: [FEAT] Add explicit brick affinity
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-15 14:13 UTC by Jeff Darcy
Modified: 2016-04-05 22:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-22 15:46:38 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Jeff Darcy 2013-07-15 14:13:25 UTC
Users need a way to specify which brick/subvolume they want a file to be on, e.g. to support VM failover.  This should be as user-friendly as possible, and survive subsequent rebalances.  Sometimes the necessary effect can be achieved by setting a custom layout on the directory, but other times it needs to be file rather than directory level.  A patch already exists as an RFC.

    http://review.gluster.org/#/c/5233/

Adding this bug so users who have already expressed an interest can weigh in.

Comment 1 pjameson 2013-07-15 17:58:48 UTC
I believe that this is actually perfect for our use case. We were looking to use a distribute/replica volume with NUFA for our VMs so that on first creation our files are created on a local volume, and reads don't end up being constrained by the network. However, if we change a VM's host, we want to be able to force the file over to a different replica pair. This feature seems to do exactly what we want.

I may be using the patched code incorrectly, but when I attempted to do a rebalance after adding the xattrs (added them to the fuse mounted volume, and verified that they showed up on each brick), no files were moved. I added a couple of debug lines, and it looks like trusted.affinity is not being pulled correctly in dht_handle_affinity.

Gluster volume setup (this is from a straight distribute setup, not the distribute/replica one):

gluster volume create test1  transport tcp node{1,2,3,4}:/mnt/raid/test1
# gluster volume info test1
 
Volume Name: test1
Type: Distribute
Volume ID: b8df5b92-b029-4552-867c-731340a58aaf
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: node1:/mnt/raid/test1
Brick2: node2:/mnt/raid/test1
Brick3: node3:/mnt/raid/test1
Brick4: node4:/mnt/raid/test1
Options Reconfigured:
cluster.nufa: 1

[root@node3 ~]# mount | grep test1
localhost:/test1 on /mnt/gluster1 type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)

[root@node3 ~]# getfattr  -d -m '.*' /mnt/gluster1/temp/
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster1/temp/
trusted.affinity="test1-client-3"
trusted.distribute.migrate-data="test1-client-3"

- Ran the rebalance here

[root@node1 glusterfs]# gluster volume rebalance test1 start force
volume rebalance: test1: success: Starting rebalance on volume test1 has been successful.
ID: 3d0ab190-48f1-4284-b5c8-b24b79f5a52b
[root@node1 glusterfs]# gluster volume rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             1             0      completed             0.00
                                   node3                0        0Bytes             1             0      completed             0.00
                                   node4                0        0Bytes             1             0      completed             0.00
                                   node2                0        0Bytes             1             0      completed             0.00
volume rebalance: test1: success: 

- Note here that the file didn't move after the rebalance:
$ for i in node{1..4}; do echo $i; ssh ${i} 'ls -lh /mnt/raid/test1/temp/test.img'; done
node1
ls: cannot access /mnt/raid/test1/temp/test.img: No such file or directory
node2
-rw-r--r-- 2 root root 100M Jul 15 12:37 /mnt/raid/test1/temp/test.img
node3
ls: cannot access /mnt/raid/test1/temp/test.img: No such file or directory
node4
ls: cannot access /mnt/raid/test1/temp/test.img: No such file or directory


Sorry if this is not the correct forum for discussing this specific patch; I wasn't sure whether to stick it here or in Gerrit

Comment 2 Jeff Darcy 2013-07-16 17:37:53 UTC
The instructions in the commit message are both misleading and incomplete, which is my fault.  To set the affinity, do:

   setfattr -n system.affinity -v $brick_name $path

This gets namespace-flipped to trusted.affinity, so you can check with getfattr.  Then, to actually move the file where it should go:

   setfattr -n distribute.migrate-data -v force $path

Setting the value to "force" is actually important.  Without that, DHT might decide not to move the file after all e.g. because the destination has less free space than the source.

I see that you did use "force" on your rebalance command.  That *should* work the same as doing the second setfattr above, but for some reason that doesn't seem to happen reliably (it does some of the time).  I'll look into that further, but the invividual-file rebalance is probably more what you want anyway.

Comment 3 Anand Avati 2013-07-16 21:35:58 UTC
REVIEW: http://review.gluster.org/5233 (dht: add brick affinity) posted (#2) for review on master by Jeff Darcy (jdarcy)

Comment 4 pjameson 2013-07-17 13:58:04 UTC
Hello,

I tested this out yesterday on a pure distribute volume, and it seems to be working very well. There are a couple of things that would be really great to get, though. It seems right now that on volumes that have cluster.nufa == 1, the migrate-data xattr cannot be set:

# setfattr -n 'system.affinity' -v 'test1-client-1' /mnt/gluster1/temp/test.img && echo "Set affinity";  setfattr -n 'distribute.migrate-data' -v 'force' /mnt/gluster1/temp/test.img && "Set 'migrate-data' to 'force'"
Set affinity
setfattr: /mnt/gluster1/temp/test.img: Invalid argument

I'm not sure if this just means that the dht/nufa translator needs fixed up, or what, but we had been intending the use NUFA so that our image creates would start out being local, and we could use this patch to move them if necessary.

I did run a quick test, and it worked great with a distribute/replica volume, just not with nufa enabled.

The second thing that I saw was that when I was running a VM, when the migrate was run, I'd get I/O errors from within the VM, and the migrate would finish, but none of the data was actually committed on the new node. That is, the vm had a sparse backing file, I started a DD, and the file on the destination node ended up with a zero size still.
I'm not sure, though, whether this is a problem with libgfapi, or with qemu. I did attempt to do something similar with a fuse mounted directory, and the migration went through without any IO errors during the DD, but I didn't recall whether the fuse module had been ported to libgfapi yet.

Let me know if you need any more details.

Comment 5 Anand Avati 2013-07-17 18:36:14 UTC
REVIEW: http://review.gluster.org/5233 (dht: add brick affinity) posted (#3) for review on master by Jeff Darcy (jdarcy)

Comment 6 Anand Avati 2013-07-17 19:48:52 UTC
REVIEW: http://review.gluster.org/5233 (dht: add brick affinity) posted (#4) for review on master by Jeff Darcy (jdarcy)

Comment 7 Niels de Vos 2014-11-27 14:45:17 UTC
Feature requests make most sense against the 'mainline' release, there is no ETA for an implementation and requests might get forgotten when filed against a particular version.

Comment 8 Kaleb KEITHLEY 2015-10-22 15:46:38 UTC
because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.


Note You need to log in before you can comment on or make changes to this bug.