Cluster node disconnected randomly

Adnan

Renowned Member
Oct 4, 2012
23
1
68
Paris, France
Hi, I have 4 servers across 2 physical sites, named PVE1, PVE2, PVE3 (on site 1) and PVE4 (on site 2).

Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together. I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Thanks
 
I do things like this. For fun. Not in Prod.
You clearly know the following bit already.

https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I'd try
systemctl stop pve-cluster
systemctl stop corosync
And start em again.

(I don't mean to sound snide. I wish this did work. I would love to use storage replication to do DR at a remote site. But it doesn't work. And they are pretty darn clear that it won't.)
 
Last edited:
Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Actually, this is your issue. :)

Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

I would be very skeptical with GUI, i.e. it might be lagging, what actually might be happening is you are losing quorum on and off in a quick succession, something the GUI does not reflect. I would definitely want to check first how often this happens with journalctl -u corosync on the odd one of those on "the other side".

PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together.

It's a bigger issue than you think, every time a node leaves or re-appears it basically disrupts the entire cluster, the remaining nodes have to form new "membership" to keep exchanging messages (this is the toll of quorum as opposed to some master-slaves system), while they can't exchange messages, they can't update files in /etc/pve (it appears as readonly), so the cluster is not really a capable of doing anything. This all despite 3 nodes are enough for a quorum.

I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Well, as @tcabernoch pitched (I don't even want to repost it;) .. I mean for this particular case) restarting the said services is really what matters (and it's basically just that during a reboot that gets you back up), but as you discovered you sometimes need to do it with them all. Yes in that case you have to do it on them all, but together, so off on all nodes first, then only on, one by one, basically you are coaxing it to catch up with one another. Start with the 3 in the same place, then add the 4th.

Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Yeah so ... you basically asked for something I have yet to add into my "tutorial" [1]. I did not get to it yet because in the process I went through some interesting situations and do not want to be taken apart for posting it (just yet). :) You basically have to have /var/lib/pve-cluster/config.db in consistent state across all nodes and need to launch them from such state. For which you need a sort of "control" node (which I am lucky to have in that Ansible scenario), but this goes against Proxmox ideology of no-masters, so ...


The official support reply for this scenario would always be to take that no 4 one out of the cluster and find another way.

Post the corosync log if you want to have an idea what you are setting yourself up for.

[1] https://forum.proxmox.com/threads/dhcp-cluster-deployment.154780/#post-706594
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!