Cluster constantly loosing quorum after node removal

tuxis

Famous Member
Jan 3, 2014
216
158
108
Ede, NL
www.tuxis.nl
Hi,

Yesterday I removed a node from my cluster. It shrank from six to five nodes. First, I have shutdown the node to be removed (it had no vms configured on it), then I ran 'pvecm delnode proxmox1-2`.

Since I removed that node, my cluster keeps falling apart. Restarting all cluster-related services works for a while, but after a few minutes one node looses quorum, and the rest follows. There is no logic in which node looses quorum first.

Nothing changed network-wise, so multicast issues seem unlikely. Not all nodes are running the exact same version of Proxmox, but that wasn't an issue before the noderemoval, so I don't expect that to be an issue either.

What I do see which is odd, are the following messages:
Jul 25 12:50:39 proxmox1-99 rrdcached[26823]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/proxmox1-99/zstore-proxmox1) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/proxmox1-99/zstore-proxmox1: illegal attempt to update using time 1406285127 when last update time is 1406285227 (minimum one second step))

Those messages show up on all nodes, for different RRD's. Restarting rrdcached doesn't help.

I'm kinda lost.. So any hit with the cluebat is much appreciated..
 
I do see these messages:
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5144 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5144 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5213 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5213 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5282 ms, flushing membership messages.


On all boxes. I've read elsewhere on the forum that this might have something to do with how busy servers are or networklatency. But the boxes are loaded at ~ 10/15% CPU. I've pinged around a lot, and that shows these results:
--- 10.10.0.3 ping statistics ---
3628 packets transmitted, 3628 received, 0% packet loss, time 3627058ms
rtt min/avg/max/mdev = 0.113/0.169/0.338/0.013 ms

--- 10.10.0.4 ping statistics ---
3614 packets transmitted, 3614 received, 0% packet loss, time 3613081ms
rtt min/avg/max/mdev = 0.084/0.136/0.654/0.041 ms

--- 10.10.0.5 ping statistics ---
3600 packets transmitted, 3600 received, 0% packet loss, time 3598999ms
rtt min/avg/max/mdev = 0.048/0.100/0.523/0.069 ms

--- 10.10.0.99 ping statistics ---
3592 packets transmitted, 3592 received, 0% packet loss, time 3591001ms
rtt min/avg/max/mdev = 0.065/0.228/8.142/0.189 ms

--- 10.10.0.1 ping statistics ---
3585 packets transmitted, 3585 received, 0% packet loss, time 3584198ms
rtt min/avg/max/mdev = 0.108/0.190/2.908/0.071 ms


Which doesn't show a networkissue, if you ask me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!