IMO - some things to look at re your cluster communications :
- How fast are your physical network interfaces ( 100-Meg , 1-Gig , 10-Gig , something faster ).
There is a possibility your interfaces might be busy moving I/O traffic to/from your VMs , and you don't have the network additional I/O capacity for the cluster to communicate to the other cluster(s).
- Are you performing backups when the cluster(s) drops ?
- Do you have interface errors and/or packet drops on your physical & virtual ethernet interfaces ( and your external switch(es) ).
- What is your current I/O bandwidth rate when you drop a node in your cluster ?
I have a Proxmox network with14-Clusters and 6-external NFS systems for my VM hard disk storage. I have 10 & 40 Gig network cards. My cluster IPs and my NFS IPs do not share any IP address space with my VMs --- My VMs & my Cluster IPs & my NFS IPs are on unique IP networks. I do this to keep unwanted/un-needed network chatter down to a minimum. I have never had a node drop out of the cluster , even with pushing 20+ Gig on multiple nodes in the cluster at the same time while all nodes are doing a backup at the same time.
North Idaho Tom Jones