pvestatd : WARNING: ipcc_send_rec failed: Connection refused

bread-baker · Dec 20, 2011

node can not rejoin cluster after network outage

see my replies, as the 1-st post was not the 1-st issue.. i looked earlier in logs so deleted a lot from this post.

bread-baker · Dec 20, 2011

ok this is earlier before we moved node to rack.

the network was down for a little time .. and here is what seems to be 1-st un usual activity in syslog:

Code:

Dec 20 12:24:37 fbc208 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 20 12:24:37 fbc208 kernel: vmbr0: port 1(eth0) entering forwarding state
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] New Configuration:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.208) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] Members Left:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] Members Joined:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] New Configuration:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.208) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] Members Left:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] Members Joined:
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:38 fbc208 corosync[2003]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[2]: 2 4
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [status] node has quorum
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[2]: 2 4
Dec 20 12:24:38 fbc208 corosync[2003]:   [CMAN  ] quorum regained, resuming activity
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] This node is within the primary component and will provide service.
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[3]: 1 2 4
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[3]: 1 2 4
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[4]: 1 2 3 4
Dec 20 12:24:38 fbc208 corosync[2003]:   [QUORUM] Members[4]: 1 2 3 4
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [dcdb] members: 1/1850, 4/1816
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [dcdb] starting data syncronisation
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [status] members: 1/1850, 4/1816
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [status] starting data syncronisation
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [dcdb] members: 1/1850, 3/1640, 4/1816
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [status] members: 1/1850, 3/1640, 4/1816
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [dcdb] members: 1/1850, 2/1288, 3/1640, 4/1816
Dec 20 12:24:38 fbc208 corosync[2003]:   [CPG   ] chosen downlist: sender r(0) ip(10.100.100.207) ; members(old:3 left:0)
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [status] members: 1/1850, 2/1288, 3/1640, 4/1816
Dec 20 12:24:38 fbc208 corosync[2003]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 20 12:24:38 fbc208 corosync[2003]: cman killed by node 2 because we were killed by cman_tool or other application
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [quorum] quorum_dispatch failed: 2 (quorum.c:174:service_quorum_dispatch)
Dec 20 12:24:38 fbc208 pmxcfs[1816]: epoll_ctl(del): Bad file descriptor (9) (loop_poll.c:652:)
Dec 20 12:24:38 fbc208 pmxcfs[1816]: [confdb] confdb_dispatch failed: 2 (confdb.c:426:service_confdb_dispatch)
Dec 20 12:24:38 fbc208 dlm_controld[2254]: daemon cpg_dispatch error 2
Dec 20 12:24:38 fbc208 dlm_controld[2254]: cluster is down, exiting
Dec 20 12:24:38 fbc208 dlm_controld[2254]: daemon cpg_dispatch error 2
Dec 20 12:24:38 fbc208 fenced[2235]: receive_start 2:4 add node with started_count 3
Dec 20 12:24:38 fbc208 fenced[2235]: cluster is down, exiting
Dec 20 12:24:38 fbc208 fenced[2235]: daemon cpg_dispatch error 2
Dec 20 12:24:38 fbc208 fenced[2235]: cpg_dispatch error 2
Dec 20 12:24:40 fbc208 pmxcfs[1816]: epoll_ctl(del): Bad file descriptor (9) (loop_poll.c:652:)
Dec 20 12:24:40 fbc208 pmxcfs[1816]: [dcdb] cpg_dispatch failed: 2 (dfsm.c:1355:dfsm_dispatch)
Dec 20 12:24:40 fbc208 kernel: dlm: closing connection to node 3
Dec 20 12:24:40 fbc208 kernel: dlm: closing connection to node 2
Dec 20 12:24:40 fbc208 kernel: dlm: closing connection to node 1
Dec 20 12:24:40 fbc208 kernel: dlm: closing connection to node 4
Dec 20 12:24:42 fbc208 pmxcfs[1816]: [dcdb] cpg_leave failed: 2 (dfsm.c:1466:dfsm_leave)
Dec 20 12:24:43 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:43 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:44 fbc208 pmxcfs[1816]: epoll_ctl(del): Bad file descriptor (9) (loop_poll.c:652:)
Dec 20 12:24:44 fbc208 pmxcfs[1816]: [status] cpg_dispatch failed: 2 (dfsm.c:1355:dfsm_dispatch)
Dec 20 12:24:45 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:45 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:47 fbc208 pmxcfs[1816]: [status] cpg_leave failed: 2 (dfsm.c:1466:dfsm_leave)
Dec 20 12:24:49 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:49 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: epoll_ctl(del): Bad file descriptor (9) (loop_poll.c:652:)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [quorum] quorum_initialize failed: 6 (quorum.c:110:service_quorum_initialize)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [quorum] can't initialize service (loop.c:266:service_start_job)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [confdb] confdb_initialize failed: 6 (confdb.c:354:service_confdb_initialize)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [confdb] can't initialize service (loop.c:266:service_start_job)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [dcdb] start cluster connection
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [dcdb] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:24:51 fbc208 pmxcfs[1816]: [dcdb] can't initialize service (loop.c:266:service_start_job)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 2 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] start cluster connection
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] can't initialize service (loop.c:266:service_start_job)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)
...
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] RRDC update error /var/lib/rrdcached/db/pve2-storage/fbc208/fbc208-storage: -1
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] RRDC update error /var/lib/rrdcached/db/pve2-storage/fbc208/local: -1
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:24:53 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)
Dec 20 12:24:57 fbc208 pmxcfs[1816]: [quorum] quorum_initialize failed: 6 (quorum.c:110:service_quorum_initialize)
Dec 20 12:24:57 fbc208 pmxcfs[1816]: [confdb] confdb_initialize failed: 6 (confdb.c:354:service_confdb_initialize)
Dec 20 12:24:57 fbc208 pmxcfs[1816]: [dcdb] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:24:57 fbc208 pmxcfs[1816]: [status] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:25:03 fbc208 pmxcfs[1816]: [quorum] quorum_initialize failed: 6 (quorum.c:110:service_quorum_initialize)

Dec 20 12:25:03 fbc208 pmxcfs[1816]: [confdb] confdb_initialize failed: 6 (confdb.c:354:service_confdb_initialize)
Dec 20 12:25:03 fbc208 pmxcfs[1816]: [dcdb] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:25:03 fbc208 pmxcfs[1816]: [status] cpg_initialize failed: 6 (dfsm.c:1381:dfsm_initialize)
Dec 20 12:25:03 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:225:dfsm_send_message_full)
Dec 20 12:25:03 fbc208 pmxcfs[1816]: [status] cpg_send_message failed: 9 (dfsm.c:329:dfsm_send_message_sync)

there is more in log. hopefully I've posted enough for someone to suggest how to get node fbc208 back into the cluster.

bread-baker · Dec 20, 2011

it looks like the link was down from
Dec 20 11:27:53 fbc208 kernel: e1000e: eth0 NIC Link is Down
Dec 20 11:27:53 fbc208 kernel: vmbr0: port 1(eth0) entering disabled state

to

Dec 20 12:24:37 fbc208 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 20 12:24:37 fbc208 kernel: vmbr0: port 1(eth0) entering forwarding state
Dec 20 12:24:38 fbc208 corosync[2003]: [CLM ] CLM CONFIGURATION CHANGE

so the issue started before the server was moved.

so the cause was disconnected from network about an hour.

bread-baker · Dec 21, 2011

now this is getting to /var/log/syslog nonstop:

Code:

Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:12 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused

bread-baker · Dec 21, 2011

here is some more info, from fbc207:

Code:

pvecm status
Version: 6.2.0
Config Version: 4
Cluster Name: fbc10
Cluster Id: 2958
Cluster Member: Yes
Cluster Generation: 84
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Total votes: 4
Node votes: 1
Quorum: 3  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: fbc207
Node ID: 1
Multicast addresses: 239.192.11.153 
Node addresses: 10.100.100.207

Code:

root@fbc207 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2011-12-17 17:11:04  fbc207
   2   M     28   2011-12-17 19:28:21  fbc206
   3   M     12   2011-12-17 18:01:24  fbc1
   4   M     84   2011-12-20 14:56:46  fbc208

note the join date for fbc208 is strange, as I set fbc208 up on 2/17

but from fbc208 :

Code:

root@fbc208 ~ # pvecm nodes
pve configuration filesystem not mounted
root@fbc208 ~ # pvecm status
pve configuration filesystem not mounted

bread-baker · Dec 21, 2011

here is info from fbc207's syslog:

Code:

Dec 20 09:44:16 fbc207 corosync[2033]:   [TOTEM ] A processor failed, forming new configuration.
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.208) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 09:44:18 fbc207 corosync[2033]:   [QUORUM] Members[3]: 1 2 3
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 09:44:18 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 09:44:18 fbc207 corosync[2033]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 20 09:44:18 fbc207 corosync[2033]:   [QUORUM] Members[3]: 1 2 3
Dec 20 09:44:18 fbc207 corosync[2033]:   [CPG   ] chosen downlist: sender r(0) ip(10.100.100.207) ; members(old:4 left:1)
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] members: 1/1850, 2/1288, 3/1640
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] starting data syncronisation
Dec 20 09:44:18 fbc207 corosync[2033]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 20 09:44:18 fbc207 kernel: dlm: closing connection to node 4
Dec 20 09:44:18 fbc207 fenced[2087]: fencing node fbc208
Dec 20 09:44:18 fbc207 fenced[2087]: fence fbc208 dev 0.0 agent none result: error no method
Dec 20 09:44:18 fbc207 fenced[2087]: fence fbc208 failed
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] cpg_send_message retried 1 times
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] received sync request (epoch 1/1850/00000009)
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [status] members: 1/1850, 2/1288, 3/1640
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [status] starting data syncronisation
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] received all states
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] leader is 1/1850
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] synced members: 1/1850, 2/1288, 3/1640
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] start sending inode updates
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] sent all (0) updates
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [dcdb] all data is up to date
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [status] received sync request (epoch 1/1850/00000009)
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [status] received all states
Dec 20 09:44:18 fbc207 pmxcfs[1850]: [status] all data is up to date
Dec 20 09:44:21 fbc207 fenced[2087]: fencing node fbc208
Dec 20 09:44:21 fbc207 fenced[2087]: fence fbc208 dev 0.0 agent none result: error no method
Dec 20 09:44:21 fbc207 fenced[2087]: fence fbc208 failed
Dec 20 09:44:24 fbc207 fenced[2087]: fencing node fbc208
Dec 20 09:44:24 fbc207 fenced[2087]: fence fbc208 dev 0.0 agent none result: error no method
Dec 20 09:44:24 fbc207 fenced[2087]: fence fbc208 failed

Dec 20 10:14:29 fbc207 fenced[2087]: fencing node fbc208 still retrying

Dec 20 10:44:40 fbc207 fenced[2087]: fencing node fbc208 still retrying

Dec 20 12:15:12 fbc207 fenced[2087]: fencing node fbc208 still retrying

Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.208) 
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 12:24:38 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.208) 

Dec 20 12:24:38 fbc207 corosync[2033]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 20 12:24:38 fbc207 corosync[2033]:   [QUORUM] Members[4]: 1 2 3 4
Dec 20 12:24:38 fbc207 corosync[2033]:   [QUORUM] Members[4]: 1 2 3 4
Dec 20 12:24:38 fbc207 pmxcfs[1850]: [dcdb] members: 1/1850, 2/1288, 3/1640, 4/1816
Dec 20 12:24:38 fbc207 pmxcfs[1850]: [dcdb] starting data syncronisation
Dec 20 12:24:38 fbc207 corosync[2033]:   [CPG   ] chosen downlist: sender r(0) ip(10.100.100.207) ; members(old:3 left:0)
Dec 20 12:24:38 fbc207 corosync[2033]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 20 12:24:38 fbc207 pmxcfs[1850]: [dcdb] cpg_send_message retried 1 times
Dec 20 12:24:38 fbc207 pmxcfs[1850]: [status] members: 1/1850, 2/1288, 3/1640, 4/1816
Dec 20 12:24:38 fbc207 pmxcfs[1850]: [status] starting data syncronisation
Dec 20 12:24:39 fbc207 fenced[2087]: telling cman to remove nodeid 4 from cluster
Dec 20 12:24:48 fbc207 corosync[2033]:   [TOTEM ] A processor failed, forming new configuration.
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.208) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 12:24:50 fbc207 corosync[2033]:   [QUORUM] Members[3]: 1 2 3
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] New Configuration:
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.1) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.206) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] #011r(0) ip(10.100.100.207) 
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] Members Left:
Dec 20 12:24:50 fbc207 corosync[2033]:   [CLM   ] Members Joined:
Dec 20 12:24:50 fbc207 corosync[2033]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 20 12:24:50 fbc207 kernel: dlm: closing connection to node 4

Dec 20 12:24:50 fbc207 corosync[2033]:   [CPG   ] chosen downlist: sender r(0) ip(10.100.100.207) ; members(old:4 left:1)
Dec 20 12:24:50 fbc207 corosync[2033]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] received sync request (epoch 1/1850/0000000A)
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] members: 1/1850, 2/1288, 3/1640
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] received sync request (epoch 1/1850/0000000A)
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] received sync request (epoch 1/1850/0000000B)
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] members: 1/1850, 2/1288, 3/1640
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] received sync request (epoch 1/1850/0000000B)
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] received all states
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] leader is 1/1850
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] synced members: 1/1850, 2/1288, 3/1640
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] start sending inode updates
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] sent all (0) updates
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [dcdb] all data is up to date
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] received all states
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] all data is up to date
Dec 20 12:24:51 fbc207 pmxcfs[1850]: [status] dfsm_deliver_queue: queue length 32

Dec 20 13:15:33 fbc207 fenced[2087]: fencing node fbc208 still retrying

[/code]

dietmar · Dec 21, 2011

I guess the easiest way to solve that is to reboot fbc208.

Seems fencing is active, but you did not configured any fencing devices. Do you rum clvmd or rgmanager (I wonder why fencing is triggered)?

bread-baker · Dec 21, 2011

I had already rebooted a few times.

i did not setup or install clvmd or rgmanager .

this morning another node , fbc207 has this,

Code:

df: `/etc/pve': Transport endpoint is not connected

and the last part of syslog:

Code:

Dec 21 05:37:24 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:24 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused

i'll check earlier in the log.

before I went to sleep this node was running fine.

bread-baker · Dec 21, 2011

the issue started when rsnapshot was rsyncing a lot of data on the network.

Code:

Dec 21 04:07:10 fbc207 rrdcached[1485]: flushing old values
Dec 21 04:07:10 fbc207 rrdcached[1485]: rotating journals
Dec 21 04:07:10 fbc207 rrdcached[1485]: started new journal /var/lib/rrdcached/journal//rrd.journal.1324458430.342788
Dec 21 04:07:10 fbc207 rrdcached[1485]: removing old journal /var/lib/rrdcached/journal//rrd.journal.1324451230.342545
Dec 21 04:08:13 fbc207 pvedaemon[890612]: <root@pam> successful auth for user 'root@pam'
Dec 21 04:11:02 fbc207 pmxcfs[1850]: [dcdb] data verification successful
Dec 21 04:15:01 fbc207 /USR/SBIN/CRON[917536]: (root) CMD (nice /usr/sbin/logrotate /fbc/etc/logrotate-fbc.conf)
Dec 21 04:17:01 fbc207 /USR/SBIN/CRON[917936]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 21 04:23:13 fbc207 pvedaemon[890612]: <root@pam> successful auth for user 'root@pam'
Dec 21 04:30:01 fbc207 /USR/SBIN/CRON[920569]: (root) CMD (ionice -c 3 /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf  daily)
Dec 21 04:35:49 fbc207 rsnapshot[921789]: /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf daily: ERROR: /usr/bin/rsync returned 255
 while processing root@fbc208:/data/fbc208-storage/dump/
Dec 21 04:35:49 fbc207 rsnapshot[921790]: WARNING: root@fbc208:/etc/ skipped due to rollback plan
Dec 21 04:35:50 fbc207 rsnapshot[921793]: /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf daily: ERROR: /usr/bin/rsync returned 255
 while processing root@fbc208:/etc/pve/
Dec 21 04:37:10 fbc207 pvestatd[470390]: status update time (6.265 seconds)
Dec 21 04:37:23 fbc207 pvestatd[470390]: status update time (8.528 seconds)
Dec 21 04:37:41 fbc207 pvestatd[470390]: status update time (6.960 seconds)
Dec 21 04:37:51 fbc207 pvestatd[470390]: status update time (7.957 seconds)
Dec 21 04:38:09 fbc207 pvestatd[470390]: status update time (5.016 seconds)
Dec 21 04:38:13 fbc207 pvedaemon[884747]: <root@pam> successful auth for user 'root@pam'
Dec 21 04:38:20 fbc207 pvestatd[470390]: status update time (5.467 seconds)
Dec 21 04:39:50 fbc207 pvestatd[470390]: status update time (5.381 seconds)
Dec 21 04:40:00 fbc207 pvestatd[470390]: status update time (6.032 seconds)
Dec 21 04:40:12 fbc207 pvestatd[470390]: status update time (8.492 seconds)
Dec 21 04:40:23 fbc207 pvestatd[470390]: status update time (8.520 seconds)
Dec 21 04:40:29 fbc207 pvestatd[470390]: status update time (5.247 seconds)
Dec 21 04:41:34 fbc207 pvestatd[470390]: status update time (9.898 seconds)
Dec 21 04:41:50 fbc207 pvestatd[470390]: status update time (5.574 seconds)

Dec 21 04:42:02 fbc207 rsnapshot[923155]: WARNING: Some files and/or directories in root@fbc207:/etc/ only transferred partially durin
g rsync operation
Dec 21 04:42:02 fbc207 rsnapshot[923160]: /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf daily: ERROR: /usr/bin/rsync returned 255 while processing root@fbc207:/etc/pve/
Dec 21 04:42:02 fbc207 rsnapshot[923161]: WARNING: Rolling back "fbc208/"
Dec 21 04:42:03 fbc207 rsnapshot[923164]: WARNING: Rolling back "fbc208/pve/"
Dec 21 04:42:03 fbc207 rsnapshot[923167]: WARNING: Rolling back "fbc207/pve/"
Dec 21 04:42:03 fbc207 rsnapshot[923169]: /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf daily: ERROR: /usr/bin/rsnapshot -c /fbc/etc/rsnapshot-prox.conf daily: completed, but with some errors
Dec 21 04:42:03 fbc207 /USR/SBIN/CRON[920567]: (CRON) error (grandchild #920569 failed with exit status 1)
Dec 21 04:42:04 fbc207 postfix/pickup[912348]: 02057CA14D: uid=0 from=<root>
Dec 21 04:42:04 fbc207 postfix/cleanup[923172]: 02057CA14D: message-id=<20111221094204.02057CA14D@fbc207.fantinibakery.com>
Dec 21 04:42:04 fbc207 postfix/qmgr[458304]: 02057CA14D: from=<root@fantinibakery.com>, size=4943, nrcpt=1 (queue active)
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:04 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:04 fbc207 postfix/smtp[923174]: 02057CA14D: to=<root@fantinibakery.com>, orig_to=<root>, relay=smtp.fantinibakery.com[10.100.100.14]:25, delay=0.3, delays=0.09/0.05/0.07/0.08, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 28E363C1)
Dec 21 04:42:04 fbc207 postfix/qmgr[458304]: 02057CA14D: removed
Dec 21 04:42:09 fbc207 pvedaemon[884747]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Dec 21 04:42:09 fbc207 pvedaemon[884747]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:09 fbc207 pvedaemon[884747]: WARNING: ipcc_send_rec failed: Connection refused
Dec 21 04:42:10 fbc207 pvedaemon[882576]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Dec 21 04:42:10 fbc207 pvedaemon[882576]: WARNING: ipcc_send_rec failed: Connection refused

bread-baker · Dec 21, 2011

does anyone have a suggestion to recover from this?

or should I reinstall and restore from backup?

bread-baker · Dec 21, 2011

so a disconnected network wire and a network flooded during a backup can break the cluster.

the question now is can it be fixed?

rsync can use --bwlimit=KBPS , which i use for off site backups.

for the cluster network, it seems that like drbd/heartbeat , dedicated network hardware should be used.

for a disconnected network wire or switch, the node should not be permanently fenced.

dietmar · Dec 21, 2011

bread-baker said:
Dec 21 05:37:34 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed: Connection refused

That is because cman is not running on that node?

It is really impossible to debug, because you always post new issues.

dietmar · Dec 21, 2011

What is that 'rsnapshot' thing doing exactly? Do you backup /etc/pve using that tool?

bread-baker · Dec 21, 2011

rsnapshot a debian package using rsync to backup data. see aptitude show rsnapshot .
using it I have daily/weekly/monthly backups on and server with extra space.

I know it is hard to debug this, but we are at different time zones and I post issues as they come up , and then try to fix things. I use my limited sysadmin skills , search forum , google etc but as this is beta software and I have a different setup then your lab, for instance the managed switch issue from 2-3 weeks ago, there are no easy answers as no one ran in to these issues.

some may be bugs or not. the data fro logs may be useful or not.

ps:
as you know backups are very important. rsnapshot is to backups as beer is to a mug. or a mountain is to hiking boots.

bread-baker · Dec 21, 2011

the thing I need to figure out is how to recover a node that has been kicked off the cluster due to an extended network outage.

I'll setup another cluster after the holidays .

bread-baker · Dec 21, 2011

to add more info.

on node fbc1, i setup workstation , and did the following. it could be that somethin installed interfered with cluster:

Code:

history|grep install
  415  Tuesday  2011-12-20  [09:00:07 -0500] aptitude install gnome-core gdm3 libcurl3 xdg-utils
  463  Tuesday  2011-12-20  [12:13:20 -0500] aptitude install sun-java6-plugin
  469  Tuesday  2011-12-20  [14:35:49 -0500] aptitude install sun-java6-plugin
  524  Tuesday  2011-12-20  [17:31:49 -0500] aptitude -t squeeze-backports install iceweasel
  527  Tuesday  2011-12-20  [17:57:57 -0500] aptitude install alpine
  541  Tuesday  2011-12-20  [17:01:02 -0500] aptitude -t squeeze-backports install  icedove

and

Code:

aptitude install alarm-clock anacron apt-file \
                 cups-pdf detox dnsmasq dstat \
                 elinks firefox flashplugin-nonfree gpm  gajim \
                 gpointing-device-settings iptraf \
                 klipper konsole heirloom-mailx icedove \
                 libsasl2-modules \
                 pidgin-libnotify localepurge ntpdate nscd \
                 putty remmina shutter  \
                 sshfs sun-java6-plugin  \
                 thunderbird unison ubuntu-restricted-extras \
                 vim  vlc whois unzip zip

then setup as ldap client

Code:

aptitude  install libnss-ldap libpam-ldap nscd

I am not sure if any of that did or not interfere with the cluster.

dietmar · Dec 21, 2011

bread-baker said:
I know it is hard to debug this, but we are at different time zones and I post issues as they come up , and then try to fix things.

Sure, I know ;-)

But in order to fix bugs, I need to be able to reproduce the bugs here. So what do you exactly backup with rsync - only /etc or the whole server?

dietmar · Dec 21, 2011

bread-baker said:
I am not sure if any of that did or not interfere with the cluster.

I don't think that interfere with the cluster.

bread-baker · Dec 21, 2011

I'll put to wiki if you want my rsnapshot setup later .

what I backup is

/etc

/etc/pve

dumps

the backups go to a 500GB or greater partition on each node.

this is what the backups look like:

Code:

ls /bkup/rsnapshot-pve
daily.0  daily.1  daily.2  daily.3  daily.4  daily.5  daily.6  weekly.0



ls /bkup/rsnapshot-pve/*

root@fbc1 /bkup # ls /bkup/rsnapshot-pve/*
/bkup/rsnapshot-pve/daily.0:
fbc1  fbc207  fbc208


/bkup/rsnapshot-pve/daily.1:
fbc1  fbc207  fbc208


/bkup/rsnapshot-pve/daily.2:
fbc1  fbc207  fbc208


/bkup/rsnapshot-pve/daily.3:
fbc1  fbc207  fbc208


/bkup/rsnapshot-pve/daily.4:
fbc1  fbc207  fbc208


/bkup/rsnapshot-pve/daily.5:
fbc1  fbc186  fbc197


/bkup/rsnapshot-pve/daily.6:
fbc1  fbc186  fbc197


/bkup/rsnapshot-pve/weekly.0:
fbc1  fbc197

root@fbc1 /bkup # ls  /bkup/rsnapshot-pve/daily.1/fbc208/data/fbc208-storage/dump
vzdump-openvz-10001-2011_12_17-02_00_02.tgz  vzdump-openvz-8001-2011_12_20-02_39_27.tgz
vzdump-openvz-10002-2011_12_17-02_14_03.tgz  vzdump-openvz-8003-2011_12_17-02_36_23.tgz
vzdump-openvz-10004-2011_12_20-02_00_01.tgz  vzdump-openvz-8004-2011_12_20-02_41_28.tgz
vzdump-openvz-102-2011_12_20-02_01_32.tgz    vzdump-openvz-8007-2011_12_20-02_59_48.tgz
vzdump-openvz-128-2011_12_20-02_12_17.tgz    vzdump-openvz-8165-2011_12_16-11_45_03.tgz
vzdump-openvz-14101-2011_12_20-02_35_16.tgz  vzdump-qemu-16140-2011_12_17-03_25_49.tgz
vzdump-openvz-205-2011_12_20-02_37_00.tgz    vzdump-qemu-27159-2011_12_17-03_51_06.tgz
vzdump-openvz-7162-2011_12_20-02_38_28.tgz   vzdump-qemu-8016-2011_12_17-04_21_09.tgz

on pve I backup daily things that change a lot. weekly the rest.

rnsapshot does this daily:

Code:

###############################
### BACKUP POINTS / SCRIPTS ###
###############################


backup  root@fbc1:/data/fbc1-storage/dump/      fbc1/
backup  root@fbc1:/etc/                         fbc1/
backup  root@fbc1:/etc/pve/                     fbc1/pve/


backup  root@fbc208:/data/fbc208-storage/dump/  fbc208/
backup  root@fbc208:/etc/                       fbc208/
backup  root@fbc208:/etc/pve/                   fbc208/pve/


# do not do here, as we already send to fbc1 at /data/fbc-bkup/ and doing both uses all /data 
# instead have fbc6 send directly to all at /data/fbc-bkup/ using fbc6-deal-with-bkup-dirs
#
#backup root@fbc6:/bkup/fbc-bkup/pvebkp2/       fbc6/
#backup root@fbc6:/etc/                         fbc6/


backup  root@fbc207:/data/fbc207-storage/dump/  fbc207/
backup  root@fbc207:/etc/                       fbc207/etc/
backup  root@fbc207:/etc/pve/                   fbc207/pve/

our main data, like orders , billing , main are on prox 1.9 , where i use drbd and heartbeat. that data changes every minute .

i have 10 ct's and some kvm's on prox 2.0 .

in addition i use a container on each prox system to do 4x per day backups of data that might change hourly from most systems. the disk usage is small . memory and cpu usage not big.

so rsnapshot runs a few places, i coordinate the times so they do not run at the same time.

pvestatd : WARNING: ipcc_send_rec failed: Connection refused

Member

Member

Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member