Bug 862332 - migrated data with "remove-brick start" unavailable until commit
Summary: migrated data with "remove-brick start" unavailable until commit
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.3.0
Hardware: x86_64
OS: Linux
medium
unspecified
Target Milestone: ---
Assignee: shishir gowda
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 867351
TreeView+ depends on / blocked
 
Reported: 2012-10-02 16:37 UTC by Shawn Heisey
Modified: 2013-12-09 01:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 867351 (view as bug list)
Environment:
Last Closed: 2012-12-05 07:27:50 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
listing of parent directory for file with odd permissions (50.97 KB, text/plain)
2012-10-23 20:03 UTC, Shawn Heisey
no flags Details
rebalance log which covers all four passes of remove-brick (1.21 MB, application/octet-stream)
2012-10-23 20:05 UTC, Shawn Heisey
no flags Details

Description Shawn Heisey 2012-10-02 16:37:14 UTC
Description of problem:
In order to retire one or more bricks from a volume, you must do a 'remove-brick start' operation, followed by 'remove-brick commit' when the migration is complete.  When doing this, each file that gets migrated becomes unavailable to the clients.  Issuing the commit operation makes all migrated files available again.

Steps to Reproduce:
1. Begin migrating data off a brick with the remove-brick start command.
2. Check the $volume-rebalance log to find a file that has been migrated.
3. Try to access the file found in step 2.  Access will fail.
4. Wait for the migration to complete.
5. If there are failures 
6. Issue the remove-brick commit operation.
7. Try to access the file again.  It will succeed.

Actual results:
Each migrated file is unavailable from the time it gets migrated until the commit operation is performed.

Expected results:
Each file should remain available after it gets migrated.  The commit operation should not be required to continue to access data.  The commit operation should simply finalize the removal, or (when it might be required) force removal with data loss if no migration has been done.


Additional info:

Bug 770346 is similar, though apparently with that bug, the data was completely lost even after the commit.

The migration seems to be prone to failures on individual files.  No failure notification is made other than a number on the 'status' screen that such failures have occurred.  Such failures are guaranteed when the available disk space on one or more bricks is less than the amount of used space on the brick that is being removed, even if the volume as a whole has plenty of space.  I will file a separate bug for that problem.

I did my tests with a 4x2 distribute-replicate volume living on two nodes (each with 4 bricks), removing both replicas of the last brick.  It is likely that the same problem would happen on a pure distribute volume, but I have not tested it.

I expect to start off with 4TB drives, one brick per drive, and each brick will contain several million files.  Migrating the data off such a brick will take several hours.  We cannot afford to have that much data be unavailable for that much time.  Someday the servers with the 4TB drives will be ancient, ready for retirement.

Comment 1 Shawn Heisey 2012-10-02 17:10:23 UTC
If the volume starts out more than half full, you are likely to run into Bug 862347 at step 4.

Comment 2 shishir gowda 2012-10-03 05:42:29 UTC
Hi Shawn,

Please attach the client logs (mount process) where the look up of such files fail. The remove-brick logs related to the files in question would also help.

Comment 3 Shawn Heisey 2012-10-03 19:02:03 UTC
As noted on Bug 862347, I completed one remove-brick run and did not run into this bug.  As of the end of that first run, all migrated files seem to be still accessible.  I will see what happens during subsequent remove-brick runs.

Comment 4 Shawn Heisey 2012-10-03 19:38:36 UTC
During the first round of testing for Bug 862347, I did not run into this bug at all.  I have no idea what's different between this run and the one where everything was unavailable.

I do plan to do another round of testing after completely deleting the volume and starting over.

Comment 5 Amar Tumballi 2012-10-23 14:17:58 UTC
it is possible that you have hit bug 852361 in the earlier testing. Can you please share info on your volume type (gluster volume info) & (gluster volume status <VOL> detail).

If you are not able to hit this bug in another few runs, we would like to close this bug as WORKSFORME

Comment 6 Shawn Heisey 2012-10-23 16:56:25 UTC
The files were definitely not owned by root, but as far as I know, nothing had them open at the time.  I haven't had time to get back to this testing, but I certainly hope to do so soon.

The output below is not from the volume I was testing with at the time, but it was on the same hardware and the setup is the same:

[root@testb1 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 182df850-96f3-4d69-95b9-18e9ea409dfb
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: testb1:/bricks/b1/testvol
Brick2: testb2:/bricks/b1/testvol
Brick3: testb1:/bricks/b2/testvol
Brick4: testb2:/bricks/b2/testvol
Brick5: testb1:/bricks/b3/testvol
Brick6: testb2:/bricks/b3/testvol
Brick7: testb1:/bricks/b4/testvol
Brick8: testb2:/bricks/b4/testvol

Volume Name: flubber
Type: Distributed-Replicate
Volume ID: f936fc99-cebc-4ff5-b52f-51ad57ba211a
Status: Created
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: testb1:/bricks/b1/flubber
Brick2: testb2:/bricks/b1/flubber
Brick3: testb1:/bricks/b2/flubber
Brick4: testb2:/bricks/b2/flubber
Brick5: testb1:/bricks/b3/flubber
Brick6: testb2:/bricks/b3/flubber
Brick7: testb1:/bricks/b4/flubber
Brick8: testb2:/bricks/b4/flubber
[root@testb1 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3357692  43663772   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3            922833364   6370736 916462628   1% /bricks/b1
/dev/sdb3            922833364   6464812 916368552   1% /bricks/b2
/dev/sdc3            922833364   5977444 916855920   1% /bricks/b3
/dev/sdd3            922833364   6355516 916477848   1% /bricks/b4
[root@testb1 ~]# mount
/dev/mapper/vg_main-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/md1 on /boot type ext4 (rw)
/dev/sda3 on /bricks/b1 type xfs (rw,noatime,nodiratime,nobarrier,inode64)
/dev/sdb3 on /bricks/b2 type xfs (rw,noatime,nodiratime,nobarrier,inode64)
/dev/sdc3 on /bricks/b3 type xfs (rw,noatime,nodiratime,nobarrier,inode64)
/dev/sdd3 on /bricks/b4 type xfs (rw,noatime,nodiratime,nobarrier,inode64)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

Comment 7 Shawn Heisey 2012-10-23 17:23:38 UTC
I realized that I did not include one of your requests.  Looks like df and mount give you most of this information, though.

[root@testb1 ~]# gluster volume status testvol detail
Status of volume: testvol
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b1/testvol
Port                 : 24009
Online               : Y
Pid                  : 1758
File System          : xfs
Device               : /dev/sda3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.0GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813115
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b1/testvol
Port                 : 24009
Online               : Y
Pid                  : 1730
File System          : xfs
Device               : /dev/sda3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.0GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813115
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b2/testvol
Port                 : 24010
Online               : Y
Pid                  : 1763
File System          : xfs
Device               : /dev/sdb3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 873.9GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813132
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b2/testvol
Port                 : 24010
Online               : Y
Pid                  : 1735
File System          : xfs
Device               : /dev/sdb3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 873.9GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813132
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b3/testvol
Port                 : 24011
Online               : Y
Pid                  : 1769
File System          : xfs
Device               : /dev/sdc3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.4GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813028
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b3/testvol
Port                 : 24011
Online               : Y
Pid                  : 1742
File System          : xfs
Device               : /dev/sdc3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.4GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813028
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b4/testvol
Port                 : 24012
Online               : Y
Pid                  : 1776
File System          : xfs
Device               : /dev/sdd3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.0GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813177
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b4/testvol
Port                 : 24012
Online               : Y
Pid                  : 1747
File System          : xfs
Device               : /dev/sdd3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 874.0GB
Total Disk Space     : 880.1GB
Inode Count          : 230820992
Free Inodes          : 230813177

Comment 8 Shawn Heisey 2012-10-23 19:29:38 UTC
I have set up a new test with 4GiB bricks.  This will be a simultaneous test of bug 862347.  I just limited the size of the xfs filesystems with -d size=4g.  Server info gathered after I filled up the volume to slightly over half full:

[root@testb1 ~]# gluster volume status testvol detail
Status of volume: testvol
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b1/testvol
Port                 : 24013
Online               : Y
Pid                  : 9751
File System          : xfs
Device               : /dev/sda3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.8GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047782
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b1/testvol
Port                 : 24013
Online               : Y
Pid                  : 9550
File System          : xfs
Device               : /dev/sda3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.8GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047782
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b2/testvol
Port                 : 24014
Online               : Y
Pid                  : 9756
File System          : xfs
Device               : /dev/sdb3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.6GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047748
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b2/testvol
Port                 : 24014
Online               : Y
Pid                  : 9556
File System          : xfs
Device               : /dev/sdb3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.6GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047748
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b3/testvol
Port                 : 24015
Online               : Y
Pid                  : 9762
File System          : xfs
Device               : /dev/sdc3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.7GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047747
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b3/testvol
Port                 : 24015
Online               : Y
Pid                  : 9561
File System          : xfs
Device               : /dev/sdc3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.7GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047747
------------------------------------------------------------------------------
Brick                : Brick testb1:/bricks/b4/testvol
Port                 : 24016
Online               : Y
Pid                  : 9768
File System          : xfs
Device               : /dev/sdd3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.7GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047741
------------------------------------------------------------------------------
Brick                : Brick testb2:/bricks/b4/testvol
Port                 : 24016
Online               : Y
Pid                  : 9567
File System          : xfs
Device               : /dev/sdd3
Mount Options        : rw,noatime,nodiratime,nobarrier,inode64
Inode Size           : 1024
Disk Space Free      : 1.7GB
Total Disk Space     : 4.0GB
Inode Count          : 1048576
Free Inodes          : 1047741


[root@testb1 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3357664  43663800   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              4184064   2282448   1901616  55% /bricks/b1
/dev/sdb3              4184064   2489576   1694488  60% /bricks/b2
/dev/sdc3              4184064   2394880   1789184  58% /bricks/b3
/dev/sdd3              4184064   2442244   1741820  59% /bricks/b4

[root@testb2 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3436492  43584972   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    122412    857236  13% /boot
/dev/sda3              4184064   2282448   1901616  55% /bricks/b1
/dev/sdb3              4184064   2489576   1694488  60% /bricks/b2
/dev/sdc3              4184064   2394880   1789184  58% /bricks/b3
/dev/sdd3              4184064   2442244   1741820  59% /bricks/b4

Comment 9 Shawn Heisey 2012-10-23 19:31:39 UTC
Info from client gathered concurrently with the server info above.  All files created using non-root user on the client.  The client just happens to be a gluster-swift UFO server, thus the mount point.  Files are created with random sizes between 16KiB and 16MiB.

[elyograg@testb3 foo]$ pwd
/mnt/gluster-object/AUTH_testvol/foo

[elyograg@testb3 foo]$ df -k .
Filesystem        1K-blocks    Used Available Use% Mounted on
localhost:testvol  16736256 9609216   7127040  58% /mnt/gluster-object/AUTH_testvol

[elyograg@testb3 foo]$ du
6707257 ./g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2
6707257 ./g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw
6707257 ./g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK
6707257 ./g1lpxY/E81phT/GPPjeB/tW8tvq
6707257 ./g1lpxY/E81phT/GPPjeB
6707257 ./g1lpxY/E81phT
6707257 ./g1lpxY
1430245 ./sC.CIW/aie4AJ/QzJ2W2/Jcx0hG/tjF-.K/NOGlG9
1430245 ./sC.CIW/aie4AJ/QzJ2W2/Jcx0hG/tjF-.K
1430245 ./sC.CIW/aie4AJ/QzJ2W2/Jcx0hG
1430245 ./sC.CIW/aie4AJ/QzJ2W2
1430245 ./sC.CIW/aie4AJ
1430245 ./sC.CIW
1334417 ./25hHAg/FktfVS/pWETA-
1334417 ./25hHAg/FktfVS
1334417 ./25hHAg
9471919 .

[elyograg@testb3 foo]$ find . -type f | wc -l
1165

Comment 10 Shawn Heisey 2012-10-23 19:36:03 UTC
Before beginning the migration, I deleted everything in /var/log/glusterfs on the first server and restarted glusterd so I will have clean logfiles.

[root@testb1 glusterfs]# rpm -qa | grep gluster
glusterfs-server-3.3.1-1.el6.x86_64
glusterfs-fuse-3.3.1-1.el6.x86_64
glusterfs-geo-replication-3.3.1-1.el6.x86_64
glusterfs-3.3.1-1.el6.x86_64

Comment 11 Shawn Heisey 2012-10-23 19:36:59 UTC
Command entered to begin migration:
gluster volume remove-brick testvol testb1:/bricks/b4/testvol testb2:/bricks/b4/testvol start

Comment 12 Shawn Heisey 2012-10-23 19:48:04 UTC
I can see that there are some permission problems after the every pass of the rebalance completes, with 9 failures.  This is on the client:

[elyograg@testb3 AUTH_testvol]$ ls -al foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34
---------T 1 root root 0 Oct 23 13:40 foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34

Comment 13 Shawn Heisey 2012-10-23 19:52:55 UTC
Later, the ownership of that entry changed, but the permissions did not update:

[elyograg@testb3 AUTH_testvol]$ ls -al foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34
---------T 1 elyograg elyograg 0 Oct 23 13:40 foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34

I will attach the full listing of the parent directory of this item so you can see that there are other entries in that directory with the odd permissions.

Comment 14 Shawn Heisey 2012-10-23 19:59:08 UTC
When I checked that same file during the fourth pass of remove-brick, it had corrected itself completely:

[elyograg@testb3 AUTH_testvol]$ ls -al foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34
-rw-rw-r-- 1 elyograg elyograg 11468800 Oct 23 13:18 foo/g1lpxY/E81phT/GPPjeB/tW8tvq/iWbIBK/N.R7Fw/JqI9N2/x39Oei34

Comment 15 Shawn Heisey 2012-10-23 20:03:53 UTC
Created attachment 632367 [details]
listing of parent directory for file with odd permissions

Here is a directory listing showing the odd permissions and zero bytes on some files.  By the time I had completed all the remove-brick passes, these errors had corrected themselves, no odd permissions.  Except for a few files, I never did run into the major unavailability problems that led me to file this bug.

Comment 16 Shawn Heisey 2012-10-23 20:05:24 UTC
Created attachment 632368 [details]
rebalance log which covers all four passes of remove-brick

Comment 17 Shawn Heisey 2012-10-23 20:08:10 UTC
It is important to know that during every single remove-brick pass, one (sometimes more than one) of the brick filesystems reached 100% capacity.

After the first remove-brick pass:

[root@testb1 glusterfs]# gluster volume remove-brick testvol testb1:/bricks/b4/testvol testb2:/bricks/b4/testvol status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost              804         6.1GB          1248             9      completed
                                  testb4                0        0Bytes             0             0    not started
                                  testb3                0        0Bytes             0             0    not started
                                  testb2                0        0Bytes          1168             0      completed

After the second pass:

[root@testb1 glusterfs]# gluster volume remove-brick testvol testb1:/bricks/b4/testvol testb2:/bricks/b4/testvol status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost              367         2.9GB          1337           467      completed
                                  testb4                0        0Bytes             0             0    not started
                                  testb3                0        0Bytes             0             0    not started
                                  testb2                0        0Bytes          1169             0      completed

After the third pass:

[root@testb1 glusterfs]# gluster volume remove-brick testvol testb1:/bricks/b4/testvol testb2:/bricks/b4/testvol status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost              345         2.7GB          1413           122      completed
                                  testb4                0        0Bytes             0             0    not started
                                  testb3                0        0Bytes             0             0    not started
                                  testb2                0        0Bytes          1168             0      completed

After the fourth pass.  Finally, no failures!  Also, no files on brick 4:

[root@testb1 glusterfs]# gluster volume remove-brick testvol testb1:/bricks/b4/testvol testb2:/bricks/b4/testvol status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost              122       940.3MB          1287             0      completed
                                  testb4                0        0Bytes             0             0    not started
                                  testb3                0        0Bytes             0             0    not started
                                  testb2                0        0Bytes          1166             0      completed

Final server-side df after all four passes and issuing remove-brick commit:

[root@testb2 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3438192  43583272   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    122412    857236  13% /boot
/dev/sda3              4184064   3146136   1037928  76% /bricks/b1
/dev/sdb3              4184064   3148236   1035828  76% /bricks/b2
/dev/sdc3              4184064   3283540    900524  79% /bricks/b3
/dev/sdd3              4184064     33880   4150184   1% /bricks/b4

Comment 18 Shawn Heisey 2012-10-23 20:14:53 UTC
Final note: it looks like the permission problems you mentioned did indeed occur, but eventually resolved themselves.  I cannot reproduce this bug, so the WORKSFORME close is probably the best option.

Comment 19 shishir gowda 2012-12-05 07:27:50 UTC
Thanks for the detailed report and follow ups.
Feel free to reopen the bug if the issue is seen again


Note You need to log in before you can comment on or make changes to this bug.