Added node then entire cluster went offline

josh89

New Member
Feb 24, 2014
4
0
1
So I went to join my 11th node to my cluster, got a unable to copy ssh-id and then all the other nodes in my cluster went red / offline - corosync log shows a bunch of re-transmits and then eventually an error about processor failed, forming new config and then i can see all the other nodes leaving the cluster - any ideas? I've tried rebooting a few nodes, restarting cman, pvestatsd etc, no luck.

corosync.log
Code:
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:10 corosync [TOTEM ] A processor failed, forming new configuration.
Feb 23 11:42:12 corosync [CLM   ] CLM CONFIGURATION CHANGE
Feb 23 11:42:12 corosync [CLM   ] New Configuration:
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.2)
Feb 23 11:42:12 corosync [CLM   ] Members Left:
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.3)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.4)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.6)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.7)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.9)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.10)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.11)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.12)
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.13)
Feb 23 11:42:12 corosync [CLM   ] Members Joined:
Feb 23 11:42:12 corosync [QUORUM] Members[9]: 1 3 4 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[8]: 1 4 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[7]: 1 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[6]: 1 6 7 8 9 10
Feb 23 11:42:12 corosync [CMAN  ] quorum lost, blocking activity
Feb 23 11:42:12 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 23 11:42:12 corosync [QUORUM] Members[5]: 1 6 7 8 9
Feb 23 11:42:12 corosync [QUORUM] Members[4]: 1 6 7 8
Feb 23 11:42:12 corosync [QUORUM] Members[3]: 1 7 8
Feb 23 11:42:12 corosync [QUORUM] Members[2]: 1 8
Feb 23 11:42:12 corosync [QUORUM] Members[1]: 1
Feb 23 11:42:12 corosync [CLM   ] CLM CONFIGURATION CHANGE
Feb 23 11:42:12 corosync [CLM   ] New Configuration:
Feb 23 11:42:12 corosync [CLM   ] 	r(0) ip(10.18.200.2)
Feb 23 11:42:12 corosync [CLM   ] Members Left:
Feb 23 11:42:12 corosync [CLM   ] Members Joined:
Feb 23 11:42:12 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 23 11:42:12 corosync [CPG   ] chosen downlist: sender r(0) ip(10.18.200.2) ; members(old:10 left:9)
Feb 23 11:42:12 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Feb 23 11:47:24 corosync [SERV  ] Unloading all Corosync service engines.
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync configuration service
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais cluster membership service B.01.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais event service B.01.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais distributed locking service B.03.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais message service B.03.01
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 23 11:47:24 corosync [SERV  ] Service engine unloaded: openais timer service A.01.01
Feb 23 11:47:24 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1893.
Feb 23 11:49:33 corosync [MAIN  ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service.
Feb 23 11:49:33 corosync [MAIN  ] Corosync built-in features: nss
Feb 23 11:49:33 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Feb 23 11:49:33 corosync [MAIN  ] Successfully parsed cman config
Feb 23 11:49:33 corosync [MAIN  ] Successfully configured openais services to load
Feb 23 11:49:33 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 23 11:49:33 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

My syslog is filled with tons of these:
Code:
Feb 24 04:46:25 proxmox1 pmxcfs[2971]: [status] crit: cpg_send_message failed: 9

Here is the status from the node that is 'online' still:
Code:
Version: 6.2.0
Config Version: 11
Cluster Name: test-cluster
Cluster Id: 20404
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 1
Expected votes: 11
Total votes: 1
Node votes: 1
Quorum: 6 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.79.4
Node addresses: 10.18.200.2

Here is the status from one of the others that is offline:
Code:
Version: 6.2.0
Config Version: 10
Cluster Name: test-cluster
Cluster Id: 20404
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 1
Expected votes: 10
Total votes: 1
Node votes: 1
Quorum: 6 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.79.4
Node addresses: 10.18.200.3

PVE version is: 3.1-21/93bf03d4

Let me know if there is any other information I can provide.
 
for sure quorum is lost (?!?)
not sure if it helps, but you can temprarily set "pvecm expected 1" to regain quorum..

how did you did the 11th join exactly?

Marco
 
Yeah I've tried that as well and it doesn't do anything to bring them back online (not sure if I need to do it on all or just one or ...?)
 
You might need to do it on all nodes since cluster wide communication is not reliable when quorum is lost.
But how could adding one node make the cluster loose quorum? Maybe your settings for detecting quorum needs an overhaul.
 
I tried setting it to expected 1 on all and that didn't do anything. I left this with the default's when I was adding everything so I haven't adjusted anything quorum-specific.
 
So I think I found the issue(s)...

I had installed a 3.10 kernel out-of-band of the proxmox updates because I needed some hardware support (the system would become unstable with the 2.6 kernel and process clock times were abnormal). Updating the kernel fixed my issues but I guess something else happened with corosync and the DLM kernel module maybe?

Anyways, I ran this on all nodes, and rebooted:

Code:
echo "deb http://download.proxmox.com/debian wheezy pvetest" >> /etc/apt/sources.list
aptitude update
aptitude upgrade -y




apt-get install pve-kernel-3.10.0-1-pve -y
apt-get remove --purge linux-image-3.10.5 -y
update-grub

and now my cluster is coming back online and all the nodes are up now. Let me know if I can provide anyone any more information if they need to debug this.