Cluster constantly loosing quorum after node removal

tuxis · Jul 25, 2014

Hi,

Yesterday I removed a node from my cluster. It shrank from six to five nodes. First, I have shutdown the node to be removed (it had no vms configured on it), then I ran 'pvecm delnode proxmox1-2`.

Since I removed that node, my cluster keeps falling apart. Restarting all cluster-related services works for a while, but after a few minutes one node looses quorum, and the rest follows. There is no logic in which node looses quorum first.

Nothing changed network-wise, so multicast issues seem unlikely. Not all nodes are running the exact same version of Proxmox, but that wasn't an issue before the noderemoval, so I don't expect that to be an issue either.

What I do see which is odd, are the following messages:
Jul 25 12:50:39 proxmox1-99 rrdcached[26823]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/proxmox1-99/zstore-proxmox1) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/proxmox1-99/zstore-proxmox1: illegal attempt to update using time 1406285127 when last update time is 1406285227 (minimum one second step))

Those messages show up on all nodes, for different RRD's. Restarting rrdcached doesn't help.

I'm kinda lost.. So any hit with the cluebat is much appreciated..

tuxis · Jul 25, 2014

Also, dlm_controld and fenced do not always die after loosing quorum...

dietmar · Jul 25, 2014

please check is time is correct on all nodes

tuxis · Jul 25, 2014

It is..

tuxis · Jul 25, 2014

I do see these messages:
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5144 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5144 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5213 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5213 ms, flushing membership messages.
proxmox1-4.dmz.tuxis.net/local4.log:Jul 25 14:28:20 proxmox1-4.dmz.tuxis.net corosync[32454]: [TOTEM ] Process pause detected for 5282 ms, flushing membership messages.

On all boxes. I've read elsewhere on the forum that this might have something to do with how busy servers are or networklatency. But the boxes are loaded at ~ 10/15% CPU. I've pinged around a lot, and that shows these results:
--- 10.10.0.3 ping statistics ---
3628 packets transmitted, 3628 received, 0% packet loss, time 3627058ms
rtt min/avg/max/mdev = 0.113/0.169/0.338/0.013 ms

--- 10.10.0.4 ping statistics ---
3614 packets transmitted, 3614 received, 0% packet loss, time 3613081ms
rtt min/avg/max/mdev = 0.084/0.136/0.654/0.041 ms

--- 10.10.0.5 ping statistics ---
3600 packets transmitted, 3600 received, 0% packet loss, time 3598999ms
rtt min/avg/max/mdev = 0.048/0.100/0.523/0.069 ms

--- 10.10.0.99 ping statistics ---
3592 packets transmitted, 3592 received, 0% packet loss, time 3591001ms
rtt min/avg/max/mdev = 0.065/0.228/8.142/0.189 ms

--- 10.10.0.1 ping statistics ---
3585 packets transmitted, 3585 received, 0% packet loss, time 3584198ms
rtt min/avg/max/mdev = 0.108/0.190/2.908/0.071 ms

Which doesn't show a networkissue, if you ask me.

tuxis · Aug 2, 2014

Ok, so I've been struggling with this all week. Here's what fixed it:
echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

On each node. The trigger to try this was this thread: http://pve.proxmox.com/pipermail/pve-devel/2013-June/008031.html

Search

Search

Cluster constantly loosing quorum after node removal

tuxis

Famous Member

tuxis

Famous Member

dietmar

Proxmox Staff Member

tuxis

Famous Member

tuxis

Famous Member

tuxis

Famous Member