Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.
Are you sure that your network is not overloaded during the backup ? (or loadaverage on node is high?)We did Update yesterday both nodes and this morning we have the same separation.
I send attached the possible start of the issue on node 1 (called vmhost02. node2 is called vmhost03) at about 02:00 I guess.
We do backup at 02:00 every day to our tape system. node1 also replicates vms to node2 every 15 minutes. The backup on node2 get completed. But the two nodes seem to diverge.
BTW, you really need 3 nodes minimum for your cluster, or you'll loose quorum each time a node is disconnect.
Do you just have one switch for all the network communication?
If that is really the case I'd suggest to add a second switch for cluster communication only, does not have to be anything fancy, even a 100 mbps would do, as long it's dedicated to corosync/kronosnet. I'd then give that link higher priority so that it only fails over to the other one if there are problems.
If it's only 2 - 3 nodes you could also omit the switch and just connect the nodes directly (ideally full mesh) for the second link.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster
Because I think that the load from replications will interfere with kronosnet/corosync packets to much.