[SOLVED] Cluster Fails after one Day - PVE 6.0.4

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Do you just have one switch for all the network communication?

If that is really the case I'd suggest to add a second switch for cluster communication only, does not have to be anything fancy, even a 100 mbps would do, as long it's dedicated to corosync/kronosnet. I'd then give that link higher priority so that it only fails over to the other one if there are problems.

If it's only 2 - 3 nodes you could also omit the switch and just connect the nodes directly (ideally full mesh) for the second link.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

Because I think that the load from replications will interfere with kronosnet/corosync packets to much.
 
We did Update yesterday both nodes and this morning we have the same separation.

I send attached the possible start of the issue on node 1 (called vmhost02. node2 is called vmhost03) at about 02:00 I guess.

We do backup at 02:00 every day to our tape system. node1 also replicates vms to node2 every 15 minutes. The backup on node2 get completed. But the two nodes seem to diverge.
Are you sure that your network is not overloaded during the backup ? (or loadaverage on node is high?)

BTW, you really need 3 nodes minimum for your cluster, or you'll loose quorum each time a node is disconnect.
 
Do you just have one switch for all the network communication?

If that is really the case I'd suggest to add a second switch for cluster communication only, does not have to be anything fancy, even a 100 mbps would do, as long it's dedicated to corosync/kronosnet. I'd then give that link higher priority so that it only fails over to the other one if there are problems.

If it's only 2 - 3 nodes you could also omit the switch and just connect the nodes directly (ideally full mesh) for the second link.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

Because I think that the load from replications will interfere with kronosnet/corosync packets to much.

Thanks for the hints. I'll try that. Network is most probably saturated. We have separate swichtes for servers and our office. I didn't expect that the network would be so much saturated that the small amount for the cluster communication would not go through, but it seems this backup affects or is a cause for whats happens.
 
The thing is that while it's only a small amount for the cluster communication, it's very latency sensitive, as the whole consensus algorithm it is based on has some maximnal timing assumptions. And thus it should not be placed on networks with storage, backup, or other high bandwidth traffic patterns (even if the are not constant but happen just in bursts of a few seconds to minutes).
 
But if the network has latency issues for a while, the cluster should come back again once the issue is mitigated. What happens now is that the failure state persists until i restart corosync. It does not self-heal. How is that?

BTW i've increased totem.token in corosync.conf from 1000ms default to 5000ms. Do you think this will help a bit?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!