Description of problem: Version-Release number of selected component (if applicable): Centos 6.3 3.4.0qa5 How reproducible: Type: Striped-Replicate Volume ID: 23dbb8dd-5cb3-4c71-9702-7c16ee9a3b3b Status: Started Number of Bricks: 1 x 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.0.51.31:/gfs Brick2: 10.0.51.32:/gfs Brick3: 10.0.51.33:/gfs Brick4: 10.0.51.34:/gfs Options Reconfigured: performance.quick-read: on performance.io-thread-count: 32 performance.cache-max-file-size: 128MB performance.cache-size: 256MB performance.io-cache: on cluster.stripe-block-size: 2MB nfs.disable: on Steps to Reproduce: 1. Create a file to edit 2. It takes a few seconds (from 3 in new files to minutes for GB files) to open 3. sometimes when storing, you get warnings of "file has been modified" A. Use a web application runing on Ruby B. Getting continuously timeout errors reading the files Actual results: I started profiling and found out one node with absurd latency figures. I stopped the node and the problem moved to another brick: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 99.94 551292.41 us 10.00 us 1996709.00 us 361 FINODELK Expected results: In 3.3.0 the problem does not appear and latency figures are right Additional info: Version 3.3.1 does not provide stripped-replicate volume (https://bugzilla.redhat.com/show_bug.cgi?id=861423)
> Additional info: > Version 3.3.1 does not provide stripped-replicate volume (https://bugzilla.redhat.com/show_bug.cgi?id=861423) Samuel, wanted to check if you tried it, we couldn't reproduce the issue inhouse in case of regular operations, and the particular issue came up only when we ran some rebalance operation on stripe-replicate volumes. Anyways, regarding this particular bug we will take a look.
I reproduced a similar performance issue by creating a stripe-replicate volume and running a ./configure in a glusterfs source tree. I see high max finodelk latency and the configure runs noticebly slow. I narrowed the volume options down and the problem persisted with a simple repx2 volume. The problem didn't occur with 3.3.0, so I ran a bisect and narrowed the problematic commit to: afr: make flush non-transactional ... which was submitted recently to help avoid vm guest hangs during self-heals.
Out of curiosity, I ran the same test against: http://review.gluster.org/#change,4280 ... which addresses the subsequent issue of non-transactional flush not pushing on the delayed change logs. This addressed the issue and performance was back to normal. I don't have the specific cause worked out here, but I see a significantly larger number of blocked locks when performance is poor, so I suspect that is the source of the delay. I subsequently ran the test with non-transactional flush and post-op-delay-secs delay set to 0 and that also addressed the issue (i.e., likely the source of the locks hanging around). Samuel, It would be valuable to know whether setting the post-op-delay-secs afr option to 0 has any effect on your observed latency. It doesn't look like gluster cli allows setting this option, so you would have to edit the volfile to include: option post-op-delay-secs 0 ... in all of the cluster/replicate stanzas.
Brian, Thanks a lot for this information :-). I think this is the reason, but I still need to confirm that. While I was testing http://review.gluster.org/#change,4280, I found that changelog_wakeup in flush executed by fuse thread is racing with delayed_post_op of the prev write fop just before the flush. When flush does changelog_wakeup it sees that fd_ctx->delay_frame is NULL so fuse thread goes ahead and winds the flush fop. Right after that, epoll_thread executes the delayed_post_op for the previous write-fop placing the frame in fd_ctx->delay_frame. Post op, Unlock for this frame will happen after post-op-delay-time i.e. 1 second. Here are the code-snippets that race: Fuse-thread: afr_flush: 2554 afr_delayed_changelog_wake_up(this, fd); <<--- Epoll-thread: afr_writev_wind_cbk: 137 local->transaction.unwind (frame, this); 138 local->transaction.resume (frame, this); <<--- As soon as that unwind happens in epoll-thread flush call would come in fuse thread. I am incorporating changes suggested by Avati to my prev patch. I will make sure this also gets addressed before sending the patch. What I need to check is why is the following code not addressing this problem? Is that also racing is something I need to check. afr_transaction: 1476 if (_does_transaction_conflict_with_delayed_post_op (frame) && 1477 local->loc.inode) { 1478 fd = fd_lookup (local->loc.inode, frame->root->pid); 1479 if (fd) { 1480 afr_delayed_changelog_wake_up (this, fd); 1481 fd_unref (fd); 1482 } 1483 } This is the second time such a regression happened :-(. I think we need to automate this case. One way to do that is to execute same test case in both 3.3.0 and HEAD and make sure the performance is not very low compared to that. Would love to hear your ideas :-).
We've got lots of pressure to provide a storage solution as soon as possible, and we've decided to stay with an old stable version 3.3.0 rather than using a qa version. Because of that I will not be unfortunately able to test the option post-op-delay-secs 0 nor the afr_transaction patch. I'm sure this improvements will be included in 3.4 and we'll be able to use it once the "stable" version releases in the coming year. Thanks a lot for the amazingly fast and eficient response.
Samuel, Thanks a lot for raising the issue. Pranith.
(In reply to comment #6) > Brian, > Thanks a lot for this information :-). > No problem! Thanks for the intuition on the problem, it makes sense. I plugged in the trace xlator to my reproducing case and identified that repeated writes to config.log lead to the delays in this particular case. ... > I am incorporating changes suggested by Avati to my prev patch. I will make > sure this also gets addressed before sending the patch. What I need to check > is why is the following code not addressing this problem? Is that also > racing is something I need to check. > > afr_transaction: > 1476 if (_does_transaction_conflict_with_delayed_post_op (frame) && > > 1477 local->loc.inode) { > > 1478 fd = fd_lookup (local->loc.inode, frame->root->pid); > > 1479 if (fd) { > > 1480 afr_delayed_changelog_wake_up (this, fd); > > 1481 fd_unref (fd); > > 1482 } > > 1483 } > I'm not sure how this is involved if flush is not a transaction, but perhaps that has to do more with your pending changes. Either way, it seems a race is possible so long as we do not explicitly synchronize flush and the post-op. > This is the second time such a regression happened :-(. I think we need to > automate this case. One way to do that is to execute same test case in both > 3.3.0 and HEAD and make sure the performance is not very low compared to > that. Would love to hear your ideas :-). Performance will be hard to test in an automated fashion. Given the above, I was able to narrow down my reproducer into a test script (attached). This is not exactly an iron clad test, but it detects the issue for me and is better than nothing I suppose. Feel free to take it as is, enhance it or use it as inspiration for better ideas. ;) The ./configure performance situation is an obvious reproducer to me. I would suggest we consider that as a non-automated test for replicated volumes. In fact, I think QE could already be running compile tests, we just might not have sufficiently called out that we should consider completion in a timely fashion a requirement..?
Created attachment 666105 [details] Potential regression test script
CHANGE: http://review.gluster.org/4428 (cluster/afr: Change order of unwind, resume for writev) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4446 (cluster/afr: wakeup delayed post op on fsync) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4447 (Tests: Test functionality of flush/fsync with afr) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4497 (Tests: Explicitly set eager-lock on) merged in master by Anand Avati (avati)