Fence victims different per node

Sakis

Active Member
Aug 14, 2013
121
6
38
I was in the middle of updating proxmox nodes. I have updated 3 of my 5 nmodes. Today suddenly the cluster started to shutdown all HA kvms. I tried to start them but the HA start procedure couldn't complete. Then i removed them from cluster.conf and i was able to start them. Examining the logs i saw plenty totem re-transmit lists and the cluster lost quorum. No fencing deferred to any node.
The result now is that all my nodes have stacked rgmannager service. I cant stop it. Also "fence_tool ls" output is different regarding victims.

Code:
root@node2:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   2   M   1616   2015-03-13 13:12:36  node2
   3   M   1664   2015-03-16 12:05:35  node3
   4   M   1652   2015-03-16 12:04:30  node4
   5   M   1652   2015-03-16 12:04:30  node8
   7   M   1620   2015-03-13 13:12:52  node11

node2
fence domain
member count  5
victim count  2
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node3
fence domain
member count  5
victim count  4
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node4
fence domain
member count  5
victim count  2
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node8
fence domain
member count  5
victim count  4
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node11
fence domain
member count  5
victim count  2
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7

How can i regain a stable state now without rebooting the nodes and mess with kvms?
 
Since yesterday that happened the issue i described i lost two members from my cluster.

Code:
node2
pvecm status
cman_tool: Cannot open connection to cman, is it running ?

node3
pvecm status
cman_tool: Cannot open connection to cman, is it running ?

node4
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1720
Membership state: Cluster-Member
Nodes: 3
Expected votes: 5
Total votes: 3
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node4
Node ID: 4
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.4 

node8
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1720
Membership state: Cluster-Member
Nodes: 3
Expected votes: 5
Total votes: 3
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node8
Node ID: 5
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.8 

node11
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1720
Membership state: Cluster-Member
Nodes: 3
Expected votes: 5
Total votes: 3
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node11
Node ID: 7
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.11

fence_tools ls

Code:
node2 no output (i cant also join cause rgmanager is stil running)
node3 no output (i cant also join cause rgmanager is stil running)

node4
fence domain
member count  3
victim count  3
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node8
fence domain
member count  3
victim count  4
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

node11
fence domain
member count  3
victim count  3
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7

I try to restart cman in node2 and node3 and i get the following error

Code:
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Unfencing self... fence_node: cannot connect to cman
[FAILED]

Any help?
 
This is the syslog from a healthy node (node11) when i restart cman in node3

Code:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] CLM CONFIGURATION CHANGE
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] New Configuration:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.4) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.8) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.11) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] Members Left:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] Members Joined:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] CLM CONFIGURATION CHANGE
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] New Configuration:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.3) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.4) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.8) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.11) 
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] Members Left:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] Members Joined:
Mar 17 09:48:14 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.3) 
Mar 17 09:48:14 node11 corosync[5460]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 17 09:48:14 node11 corosync[5460]:   [QUORUM] Members[4]: 3 4 5 7
Mar 17 09:48:14 node11 corosync[5460]:   [QUORUM] Members[4]: 3 4 5 7
Mar 17 09:48:14 node11 corosync[5460]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.4) ; members(old:3 left:0)
Mar 17 09:48:14 node11 corosync[5460]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: members: 3/115302, 4/4026, 5/4042, 7/3827
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: starting data syncronisation
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: members: 3/115302, 4/4026, 5/4042, 7/3827
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: starting data syncronisation
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: received sync request (epoch 3/115302/00000001)
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: received sync request (epoch 3/115302/00000001)
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: received all states
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: leader is 4/4026
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: synced members: 4/4026, 5/4042, 7/3827
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: all data is up to date
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: received all states
Mar 17 09:48:16 node11 pmxcfs[3827]: [dcdb] notice: all data is up to date
Mar 17 09:48:20 node11 pvestatd[6532]: status update time (9.244 seconds)
Mar 17 09:48:26 node11 pvestatd[6532]: status update time (6.225 seconds)
Mar 17 09:48:36 node11 pvestatd[6532]: status update time (6.240 seconds)
Mar 17 09:48:56 node11 pvestatd[6532]: status update time (6.252 seconds)
Mar 17 09:49:12 node11 corosync[5460]:   [TOTEM ] A processor failed, forming new configuration.
Mar 17 09:49:12 node11 pvestatd[6532]: status update time (12.239 seconds)
Mar 17 09:49:22 node11 pvestatd[6532]: status update time (9.231 seconds)
Mar 17 09:49:38 node11 pvestatd[6532]: status update time (6.238 seconds)
Mar 17 09:49:48 node11 pvestatd[6532]: status update time (6.228 seconds)
Mar 17 09:49:59 node11 pvestatd[6532]: status update time (6.245 seconds)
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] CLM CONFIGURATION CHANGE
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] New Configuration:
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.4) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.8) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.11) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] Members Left:
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.3) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] Members Joined:
Mar 17 09:50:08 node11 corosync[5460]:   [QUORUM] Members[3]: 4 5 7
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] CLM CONFIGURATION CHANGE
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] New Configuration:
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.4) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.8) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] #011r(0) ip(10.0.0.11) 
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] Members Left:
Mar 17 09:50:08 node11 corosync[5460]:   [CLM   ] Members Joined:
Mar 17 09:50:08 node11 corosync[5460]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 17 09:50:08 node11 kernel: dlm: closing connection to node 3
Mar 17 09:50:08 node11 corosync[5460]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.4) ; members(old:4 left:1)
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: members: 4/4026, 5/4042, 7/3827
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: starting data syncronisation
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: members: 4/4026, 5/4042, 7/3827
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: starting data syncronisation
Mar 17 09:50:08 node11 corosync[5460]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: received sync request (epoch 4/4026/00000029)
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: received sync request (epoch 4/4026/00000029)
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: received all states
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: leader is 4/4026
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: synced members: 4/4026, 5/4042, 7/3827
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: all data is up to date
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: received all states
Mar 17 09:50:08 node11 pmxcfs[3827]: [dcdb] notice: all data is up to date
Mar 17 09:50:08 node11 pmxcfs[3827]: [status] notice: dfsm_deliver_queue: queue length 1193
Mar 17 09:50:38 node11 pvestatd[6532]: status update time (6.247 seconds)
Mar 17 09:50:58 node11 pvestatd[6532]: status update time (6.237 seconds)
Mar 17 09:51:08 node11 pvestatd[6532]: status update time (6.248 seconds)
Mar 17 09:51:18 node11 pvestatd[6532]: status update time (6.232 seconds)
Mar 17 09:51:32 node11 pvestatd[6532]: status update time (9.250 seconds)
Mar 17 09:51:47 node11 pvestatd[6532]: status update time (15.232 seconds)
Mar 17 09:51:56 node11 pvestatd[6532]: status update time (9.236 seconds)
 
Last edited:
I decided to "fence_node -vv node2" and node3.
After boot, the two faulty nodes are again in cluster.

"fence_tool ls"
node2,node3 think fence master is node4
node4,node8,node11 think fence master is node2
also victims vary
Code:
root@node2:~# for i in 2 3 4 8 11; do ssh node$i fence_tool ls; done
fence domain
member count  5
victim count  0
victim now    0
master nodeid 4
wait state    none
members       2 3 4 5 7 

fence domain
member count  5
victim count  1
victim now    0
master nodeid 4
wait state    none
members       2 3 4 5 7 

fence domain
member count  5
victim count  3
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

fence domain
member count  5
victim count  4
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7 

fence domain
member count  5
victim count  3
victim now    0
master nodeid 2
wait state    messages
members       2 3 4 5 7

All nodes are joined

Code:
root@node2:~# for i in 2 3 4 8 11; do ssh node$i cman_tool nodes; done
Node  Sts   Inc   Joined               Name
   2   M   1704   2015-03-17 15:58:44  node2
   3   M   1744   2015-03-17 15:58:44  node3
   4   M   1744   2015-03-17 15:58:44  node4
   5   M   1744   2015-03-17 15:58:44  node8
   7   M   1744   2015-03-17 15:58:44  node11
Node  Sts   Inc   Joined               Name
   2   M   1744   2015-03-17 15:58:45  node2
   3   M   1736   2015-03-17 15:19:35  node3
   4   M   1740   2015-03-17 15:19:37  node4
   5   M   1740   2015-03-17 15:19:37  node8
   7   M   1740   2015-03-17 15:19:37  node11
Node  Sts   Inc   Joined               Name
   2   M   1744   2015-03-17 15:58:45  node2
   3   M   1740   2015-03-17 15:19:37  node3
   4   M   1564   2015-02-02 22:00:51  node4
   5   M   1688   2015-03-16 17:17:45  node8
   7   M   1652   2015-03-16 12:04:30  node11
Node  Sts   Inc   Joined               Name
   2   M   1744   2015-03-17 15:58:45  node2
   3   M   1740   2015-03-17 15:19:37  node3
   4   M   1688   2015-03-16 17:17:45  node4
   5   M   1632   2015-03-16 10:24:38  node8
   7   M   1688   2015-03-16 17:17:45  node11
Node  Sts   Inc   Joined               Name
   2   M   1744   2015-03-17 15:58:45  node2
   3   M   1740   2015-03-17 15:19:37  node3
   4   M   1652   2015-03-16 12:04:30  node4
   5   M   1688   2015-03-16 17:17:45  node8
   7   M   1580   2015-03-03 13:35:40  node11


I notice that node2 and node3 have "Ports bound 0" only. missing 177. what is this?
Code:
root@node2:~# for i in 2 3 4 8 11; do ssh node$i cman_tool status; done
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1744
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0  
Node name: node2
Node ID: 2
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.2 
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1744
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0  
Node name: node3
Node ID: 3
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.3 
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1744
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node4
Node ID: 4
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.4 
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1744
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node8
Node ID: 5
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.8 
Version: 6.2.0
Config Version: 581
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 1744
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0 177  
Node name: node11
Node ID: 7
Multicast addresses: 239.192.52.104 
Node addresses: 10.0.0.11

Any help?
 
Cluster is very unstable. From hour to hour it looses or regain quorum. Reboots do not solve the problem. After a while one by one the nodes get out of sync.
Any help?