[SOLVED] Cluster Fails after one Day - PVE 6.0.4

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Do you just have one switch for all the network communication?

If that is really the case I'd suggest to add a second switch for cluster communication only, does not have to be anything fancy, even a 100 mbps would do, as long it's dedicated to corosync/kronosnet. I'd then give that link higher priority so that it only fails over to the other one if there are problems.

If it's only 2 - 3 nodes you could also omit the switch and just connect the nodes directly (ideally full mesh) for the second link.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

Because I think that the load from replications will interfere with kronosnet/corosync packets to much.
 
We did Update yesterday both nodes and this morning we have the same separation.

I send attached the possible start of the issue on node 1 (called vmhost02. node2 is called vmhost03) at about 02:00 I guess.

We do backup at 02:00 every day to our tape system. node1 also replicates vms to node2 every 15 minutes. The backup on node2 get completed. But the two nodes seem to diverge.
Are you sure that your network is not overloaded during the backup ? (or loadaverage on node is high?)

BTW, you really need 3 nodes minimum for your cluster, or you'll loose quorum each time a node is disconnect.
 
Do you just have one switch for all the network communication?

If that is really the case I'd suggest to add a second switch for cluster communication only, does not have to be anything fancy, even a 100 mbps would do, as long it's dedicated to corosync/kronosnet. I'd then give that link higher priority so that it only fails over to the other one if there are problems.

If it's only 2 - 3 nodes you could also omit the switch and just connect the nodes directly (ideally full mesh) for the second link.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

Because I think that the load from replications will interfere with kronosnet/corosync packets to much.

Thanks for the hints. I'll try that. Network is most probably saturated. We have separate swichtes for servers and our office. I didn't expect that the network would be so much saturated that the small amount for the cluster communication would not go through, but it seems this backup affects or is a cause for whats happens.
 
The thing is that while it's only a small amount for the cluster communication, it's very latency sensitive, as the whole consensus algorithm it is based on has some maximnal timing assumptions. And thus it should not be placed on networks with storage, backup, or other high bandwidth traffic patterns (even if the are not constant but happen just in bursts of a few seconds to minutes).
 
But if the network has latency issues for a while, the cluster should come back again once the issue is mitigated. What happens now is that the failure state persists until i restart corosync. It does not self-heal. How is that?

BTW i've increased totem.token in corosync.conf from 1000ms default to 5000ms. Do you think this will help a bit?