888174 – low read performance on stripped replicated volume in 3.4.0qa5

Bug 888174 - low read performance on stripped replicated volume in 3.4.0qa5

Summary: low read performance on stripped replicated volume in 3.4.0qa5

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	892425 902210
TreeView+	depends on / blocked

Reported:	2012-12-18 07:52 UTC by Samuel
Modified:	2013-07-24 17:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.4.0
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	892425 902210 (view as bug list)
Environment:
Last Closed:	2013-07-24 17:27:24 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Potential regression test script (826 bytes, text/troff) 2012-12-19 14:42 UTC, Brian Foster	no flags	Details
View All

Description Samuel 2012-12-18 07:52:35 UTC

Description of problem:


Version-Release number of selected component (if applicable):
Centos 6.3
3.4.0qa5

How reproducible:
Type: Striped-Replicate
Volume ID: 23dbb8dd-5cb3-4c71-9702-7c16ee9a3b3b
Status: Started
Number of Bricks: 1 x 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.0.51.31:/gfs
Brick2: 10.0.51.32:/gfs
Brick3: 10.0.51.33:/gfs
Brick4: 10.0.51.34:/gfs
Options Reconfigured:
performance.quick-read: on
performance.io-thread-count: 32
performance.cache-max-file-size: 128MB
performance.cache-size: 256MB
performance.io-cache: on
cluster.stripe-block-size: 2MB
nfs.disable: on

Steps to Reproduce:
1. Create a file to edit
2. It takes a few seconds (from 3 in new files to minutes for GB files) to open
3. sometimes when storing, you get warnings of "file has been modified"

A. Use a web application runing on Ruby
B. Getting continuously timeout errors reading the files
  
Actual results:

I started profiling and found out one node with absurd latency figures. I stopped the node and the problem moved to another brick:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
99.94  551292.41 us      10.00 us 1996709.00 us            361    FINODELK

Expected results:
In 3.3.0 the problem does not appear and latency figures are right

Additional info:
Version 3.3.1 does not provide stripped-replicate volume (https://bugzilla.redhat.com/show_bug.cgi?id=861423)

Comment 2 Amar Tumballi 2012-12-18 08:56:23 UTC

> Additional info:
> Version 3.3.1 does not provide stripped-replicate volume (https://bugzilla.redhat.com/show_bug.cgi?id=861423)

Samuel, wanted to check if you tried it, we couldn't reproduce the issue inhouse in case of regular operations, and the particular issue came up only when we ran some rebalance operation on stripe-replicate volumes.

Anyways, regarding this particular bug we will take a look.

Comment 4 Brian Foster 2012-12-18 19:08:37 UTC

I reproduced a similar performance issue by creating a stripe-replicate volume and running a ./configure in a glusterfs source tree. I see high max finodelk latency and the configure runs noticebly slow. I narrowed the volume options down and the problem persisted with a simple repx2 volume. The problem didn't occur with 3.3.0, so I ran a bisect and narrowed the problematic commit to:

afr: make flush non-transactional

... which was submitted recently to help avoid vm guest hangs during self-heals.

Comment 5 Brian Foster 2012-12-18 19:32:44 UTC

Out of curiosity, I ran the same test against:

http://review.gluster.org/#change,4280

... which addresses the subsequent issue of non-transactional flush not pushing on the delayed change logs. This addressed the issue and performance was back to normal. I don't have the specific cause worked out here, but I see a significantly larger number of blocked locks when performance is poor, so I suspect that is the source of the delay. I subsequently ran the test with non-transactional flush and post-op-delay-secs delay set to 0 and that also addressed the issue (i.e., likely the source of the locks hanging around).

Samuel,

It would be valuable to know whether setting the post-op-delay-secs afr option to 0 has any effect on your observed latency. It doesn't look like gluster cli allows setting this option, so you would have to edit the volfile to include:

    option post-op-delay-secs 0

... in all of the cluster/replicate stanzas.

Comment 6 Pranith Kumar K 2012-12-19 06:32:45 UTC

Brian,
    Thanks a lot for this information :-).

I think this is the reason, but I still need to confirm that.
    While I was testing http://review.gluster.org/#change,4280, I found that changelog_wakeup in flush executed by fuse thread is racing with delayed_post_op of the prev write fop just before the flush. When flush does changelog_wakeup it sees that fd_ctx->delay_frame is NULL so fuse thread goes ahead and winds the flush fop. Right after that, epoll_thread executes the delayed_post_op for the previous write-fop placing the frame in fd_ctx->delay_frame. Post op, Unlock for this frame will happen after post-op-delay-time i.e. 1 second.

Here are the code-snippets that race:

Fuse-thread:
afr_flush:                
2554         afr_delayed_changelog_wake_up(this, fd); <<---

Epoll-thread:
afr_writev_wind_cbk:
 137                 local->transaction.unwind (frame, this);                                                                                                                                  
 138                 local->transaction.resume (frame, this); <<---

As soon as that unwind happens in epoll-thread flush call would come in fuse thread.

I am incorporating changes suggested by Avati to my prev patch. I will make sure this also gets addressed before sending the patch. What I need to check is why is the following code not addressing this problem? Is that also racing is something I need to check.

afr_transaction:
1476         if (_does_transaction_conflict_with_delayed_post_op (frame) &&                                                                                                                    
1477             local->loc.inode) {                                                  
1478                 fd = fd_lookup (local->loc.inode, frame->root->pid);             
1479                 if (fd) {                                                        
1480                         afr_delayed_changelog_wake_up (this, fd);                
1481                         fd_unref (fd);                                           
1482                 }                                                                
1483         }    

This is the second time such a regression happened :-(. I think we need to automate this case. One way to do that is to execute same test case in both 3.3.0 and HEAD and make sure the performance is not very low compared to that. Would love to hear your ideas :-).

Comment 7 Samuel 2012-12-19 07:58:45 UTC

We've got lots of pressure to provide a storage solution as soon as possible, and we've decided to stay with an old stable version 3.3.0 rather than using a qa version. Because of that I will not be unfortunately able to test the option post-op-delay-secs 0 nor the afr_transaction patch.
I'm sure this improvements will be included in 3.4 and we'll be able to use it once the "stable" version releases in the coming year.

Thanks a lot for the amazingly fast and eficient response.

Comment 8 Pranith Kumar K 2012-12-19 09:15:56 UTC

Samuel,
    Thanks a lot for raising the issue.

Pranith.

Comment 9 Brian Foster 2012-12-19 14:40:35 UTC

(In reply to comment #6)
> Brian,
>     Thanks a lot for this information :-).
> 

No problem! Thanks for the intuition on the problem, it makes sense. I plugged in the trace xlator to my reproducing case and identified that repeated writes to config.log lead to the delays in this particular case.

...
> I am incorporating changes suggested by Avati to my prev patch. I will make
> sure this also gets addressed before sending the patch. What I need to check
> is why is the following code not addressing this problem? Is that also
> racing is something I need to check.
> 
> afr_transaction:
> 1476         if (_does_transaction_conflict_with_delayed_post_op (frame) && 
> 
> 1477             local->loc.inode) {                                        
> 
> 1478                 fd = fd_lookup (local->loc.inode, frame->root->pid);   
> 
> 1479                 if (fd) {                                              
> 
> 1480                         afr_delayed_changelog_wake_up (this, fd);      
> 
> 1481                         fd_unref (fd);                                 
> 
> 1482                 }                                                      
> 
> 1483         }    
> 

I'm not sure how this is involved if flush is not a transaction, but perhaps that has to do more with your pending changes. Either way, it seems a race is possible so long as we do not explicitly synchronize flush and the post-op.

> This is the second time such a regression happened :-(. I think we need to
> automate this case. One way to do that is to execute same test case in both
> 3.3.0 and HEAD and make sure the performance is not very low compared to
> that. Would love to hear your ideas :-).

Performance will be hard to test in an automated fashion. Given the above, I was able to narrow down my reproducer into a test script (attached). This is not exactly an iron clad test, but it detects the issue for me and is better than nothing I suppose. Feel free to take it as is, enhance it or use it as inspiration for better ideas. ;)

The ./configure performance situation is an obvious reproducer to me. I would suggest we consider that as a non-automated test for replicated volumes. In fact, I think QE could already be running compile tests, we just might not have sufficiently called out that we should consider completion in a timely fashion a requirement..?

Comment 10 Brian Foster 2012-12-19 14:42:27 UTC

Created attachment 666105 [details]
Potential regression test script

Comment 11 Vijay Bellur 2013-01-29 20:30:07 UTC

CHANGE: http://review.gluster.org/4428 (cluster/afr: Change order of unwind, resume for writev) merged in master by Anand Avati (avati)

Comment 12 Vijay Bellur 2013-01-29 20:33:18 UTC

CHANGE: http://review.gluster.org/4446 (cluster/afr: wakeup delayed post op on fsync) merged in master by Anand Avati (avati)

Comment 13 Vijay Bellur 2013-01-29 20:34:30 UTC

CHANGE: http://review.gluster.org/4447 (Tests: Test functionality of flush/fsync with afr) merged in master by Anand Avati (avati)

Comment 14 Vijay Bellur 2013-02-09 03:20:23 UTC

CHANGE: http://review.gluster.org/4497 (Tests: Explicitly set eager-lock on) merged in master by Anand Avati (avati)

Note You need to log in before you can comment on or make changes to this bug.