799716 – Glusterd crashed while performing geo-replication start operation.

Bug 799716 - Glusterd crashed while performing geo-replication start operation.

Summary: Glusterd crashed while performing geo-replication start operation.

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.2.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Venky Shankar
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	801692 (view as bug list)
Depends On:
Blocks:	852308
TreeView+	depends on / blocked

Reported:	2012-03-04 15:10 UTC by Vijaykumar Koppad
Modified:	2014-08-25 00:49 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	852308 (view as bug list)
Environment:
Last Closed:	2013-02-28 08:54:44 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2012-03-04 15:10:20 UTC

Description of problem:
I was working on the USC issue of geo-replication. While starting geo-replication session i got this crash.

Version-Release number of selected component (if applicable):release3.2.6 master 


How reproducible:Not sure

Additional info:

This is bt of the core.
################################################################################

#0  0x00002aab1ceb99ea in _dict_lookup (this=0x31e2950, key=0x2aaaaab241a1 "gsync-count") at dict.c:209
#1  0x00002aab1ceb9b36 in dict_get_with_ref (this=0x31e2950, key=0x2aaaaab241a1 "gsync-count", data=0x7ffff93072d0) at dict.c:1299
#2  0x00002aab1cebbdc3 in dict_get_int32 (this=0x2aaaaab241a9, key=0xb <Address 0xb out of bounds>, val=0x7ffff930732c) at dict.c:1649
#3  0x00002aaaaaaea91b in glusterd_read_status_file (master=0x2f2e560 "vol10", slave=0x3038155 "gluster://10.1.11.86:vol10", dict=0x31e2950) at glusterd-op-sm.c:4124
#4  0x00002aaaaaaeacac in glusterd_get_gsync_status_mst_slv (volinfo=0x2f2e560, slave=0x3038155 "gluster://10.1.11.86:vol10", rsp_dict=0x31e2950)
    at glusterd-op-sm.c:4295
#5  0x00002aab1ceb8eb6 in dict_foreach (dict=0x2aaaaab241a9, fn=0x2aaaaaaeae00 <_get_status_mst_slv>, data=0x7ffff930c430) at dict.c:1198
#6  0x00002aaaaaae0859 in glusterd_get_gsync_status_mst (volinfo=0x2f2e560, rsp_dict=0x31e2950) at glusterd-op-sm.c:4310
#7  0x00002aaaaaaef4ca in glusterd_get_gsync_status (dict=<value optimized out>, op_errstr=0x7ffff930dd28, rsp_dict=<value optimized out>) at glusterd-op-sm.c:4329
#8  glusterd_op_gsync_set (dict=<value optimized out>, op_errstr=0x7ffff930dd28, rsp_dict=<value optimized out>) at glusterd-op-sm.c:4768
#9  0x00002aaaaaaf24a7 in glusterd_op_commit_perform (op=<value optimized out>, dict=0x3032d20, op_errstr=0x7ffff930dd28, rsp_dict=0x9a8cae9f) at glusterd-op-sm.c:7646
#10 0x00002aaaaaaf35a3 in glusterd_op_ac_commit_op (event=<value optimized out>, ctx=0x31e2470) at glusterd-op-sm.c:7441
#11 0x00002aaaaaae04cf in glusterd_op_sm () at glusterd-op-sm.c:8458
#12 0x00002aaaaaac78c7 in glusterd_handle_commit_op (req=<value optimized out>) at glusterd-handler.c:601
#13 0x00002aab1d1121e1 in rpcsvc_handle_rpc_call (svc=0x2eefb00, trans=<value optimized out>, msg=0x30411c0) at rpcsvc.c:480
#14 0x00002aab1d1123ec in rpcsvc_notify (trans=0x2ef32d0, mydata=0x2aaaaab241a9, event=<value optimized out>, data=0x30411c0) at rpcsvc.c:576
#15 0x00002aab1d113317 in rpc_transport_notify (this=0x2aaaaab241a9, event=RPC_TRANSPORT_ACCEPT, data=0x9a8cae9f) at rpc-transport.c:919
#16 0x00002aaaaadec5ef in socket_event_poll_in (this=0x2ef32d0) at socket.c:1647
#17 0x00002aaaaadec798 in socket_event_handler (fd=<value optimized out>, idx=1, data=0x2ef32d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:1762
#18 0x00002aab1cee7631 in event_dispatch_epoll_handler (event_pool=0x2ee7370) at event.c:794
#19 event_dispatch_epoll (event_pool=0x2ee7370) at event.c:856
#20 0x000000000040566e in main (argc=1, argv=0x7ffff930e638) at glusterfsd.c:1509
(gdb) f 0
#0  0x00002aab1ceb99ea in _dict_lookup (this=0x31e2950, key=0x2aaaaab241a1 "gsync-count") at dict.c:209
209	        for (pair = this->members[hashval]; pair != NULL; pair = pair->hash_next) {
(gdb) f 1 
#1  0x00002aab1ceb9b36 in dict_get_with_ref (this=0x31e2950, key=0x2aaaaab241a1 "gsync-count", data=0x7ffff93072d0) at dict.c:1299
1299	                pair = _dict_lookup (this, key);
(gdb) f 2 
#2  0x00002aab1cebbdc3 in dict_get_int32 (this=0x2aaaaab241a9, key=0xb <Address 0xb out of bounds>, val=0x7ffff930732c) at dict.c:1649
1649	        ret = dict_get_with_ref (this, key, &data);
(gdb) f 3
#3  0x00002aaaaaaea91b in glusterd_read_status_file (master=0x2f2e560 "vol10", slave=0x3038155 "gluster://10.1.11.86:vol10", dict=0x31e2950) at glusterd-op-sm.c:4124
4124	        ret = dict_get_int32 (dict, "gsync-count", &gsync_count);
(gdb) 
##############################################################################

logs 
eer (10.1.11.84:792)
[2012-03-04 13:27:44.374352] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected)
, peer (10.1.11.84:795)
[2012-03-04 13:27:44.374918] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected)
, peer (10.1.11.84:800)
[2012-03-04 13:27:44.375451] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected)
, peer (10.1.11.84:812)
[2012-03-04 13:27:44.375900] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected)
, peer (10.1.11.84:823)
[2012-03-04 13:27:44.376266] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected)
, peer (10.1.11.84:891)
[2012-03-04 13:27:44.376813] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.11.84:894)
[2012-03-04 13:27:44.377199] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.11.84:897)
[2012-03-04 13:27:44.377678] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.11.84:900)
[2012-03-04 13:27:46.792311] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.11.84:584)
[2012-03-04 13:27:46.804335] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.11.84:581)

peer (127.0.0.1:366)
[2012-03-04 13:36:24.911260] I [glusterd-handler.c:1729:glusterd_handle_gsync_set] 0-: master not found, while handlinggeo-replication options
[2012-03-04 13:36:24.911302] I [glusterd-handler.c:1736:glusterd_handle_gsync_set] 0-: slave not not found, whilehandling geo-replication options
[2012-03-04 13:36:24.911367] I [glusterd-utils.c:243:glusterd_lock] 0-glusterd: Cluster lock held by 97f387d3-9c0f-4a6f-8cdb-d26921070844
[2012-03-04 13:36:24.911386] I [glusterd-handler.c:420:glusterd_op_txn_begin] 0-glusterd: Acquired local lock
[2012-03-04 13:36:24.912240] I [glusterd-rpc-ops.c:758:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC from uuid: fb3f15ba-7c4d-4b79-96f9-92bf5bec535c
[2012-03-04 13:36:24.912426] I [glusterd-rpc-ops.c:758:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC from uuid: 13d67a6b-566e-42a0-89b1-463a4ebe89b3
[2012-03-04 13:36:24.912606] I [glusterd-op-sm.c:6737:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 2 peers
[2012-03-04 13:36:24.912976] I [glusterd-rpc-ops.c:1056:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: fb3f15ba-7c4d-4b79-96f9-92bf5bec535c
[2012-03-04 13:36:24.913038] I [glusterd-rpc-ops.c:1056:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 13d67a6b-566e-42a0-89b1-463a4ebe89b3
[2012-03-04 13:36:27.291071] I [glusterd-op-sm.c:6854:glusterd_op_ac_send_commit_op] 0-glusterd: Sent op req to 2 peers
[2012-03-04 13:36:28.677611] I [glusterd-rpc-ops.c:1242:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 13d67a6b-566e-42a0-89b1-463a4ebe89b3
[2012-03-04 13:36:30.761185] I [glusterd-rpc-ops.c:1242:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: fb3f15ba-7c4d-4b79-96f9-92bf5bec535c
[2012-03-04 13:36:30.762003] I [glusterd-rpc-ops.c:817:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC from uuid: 13d67a6b-566e-42a0-89b1-463a4ebe89b3
[2012-03-04 13:36:30.762120] I [glusterd-rpc-ops.c:817:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC from uuid: fb3f15ba-7c4d-4b79-96f9-92bf5bec535c
[2012-03-04 13:36:30.762159] I [glusterd-op-sm.c:7250:glusterd_op_txn_complete] 0-glusterd: Cleared local lock
[2012-03-04 13:36:30.765011] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:942)

Comment 1 Amar Tumballi 2012-03-12 09:46:28 UTC

please update these bugs w.r.to 3.3.0qa27, need to work on it as per target milestone set.

Comment 2 Venky Shankar 2012-03-15 16:49:19 UTC

Similar to bug #801692. For 3.2qa* release.

Comment 3 Vijay Bellur 2012-04-04 07:54:50 UTC

*** Bug 801692 has been marked as a duplicate of this bug. ***

Comment 4 Venky Shankar 2012-10-19 05:19:55 UTC

Vijaykumar,

Can you try the steps again to see if this is reproducible? I can't hit this case.

Comment 5 Venky Shankar 2013-02-26 12:16:17 UTC

Vijaykumar,

Please test this out. If it's not reproducible please close it.

Note You need to log in before you can comment on or make changes to this bug.