PVE cluster nodes frequently go offline

liyk

New Member
Aug 28, 2024
2
0
1
The cluster only has 2 nodes, and recently there has been a frequent occurrence of one offline node. After restarting the PvE Cluster and Colosync services, it will briefly recover, but after a while, it will continue offline.
1724814446674.jpeg

1724814471379.jpeg
 
It would be helpful to have more information what happens during those down times. As I can see from the screenshots, the first one is from the second node P-proxmox2 and the second one is from the first node P-proxmox1. Can you ping the second node from the first node and vice versa without any packet losses? Could you post the output of journalctl -u pve-cluster -u pvestatd when that happens?

FYI, it is not a good idea to have a cluster with only two nodes, as they can loose quorum very easily by just losing the other node. You should set up a Qdevice in case you're not planning to expand your cluster with a third node anytime soon. See here [1] for more information.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
 
IMO - some things to look at re your cluster communications :
- How fast are your physical network interfaces ( 100-Meg , 1-Gig , 10-Gig , something faster ).
There is a possibility your interfaces might be busy moving I/O traffic to/from your VMs , and you don't have the network additional I/O capacity for the cluster to communicate to the other cluster(s).
- Are you performing backups when the cluster(s) drops ?
- Do you have interface errors and/or packet drops on your physical & virtual ethernet interfaces ( and your external switch(es) ).
- What is your current I/O bandwidth rate when you drop a node in your cluster ?

I have a Proxmox network with14-Clusters and 6-external NFS systems for my VM hard disk storage. I have 10 & 40 Gig network cards. My cluster IPs and my NFS IPs do not share any IP address space with my VMs --- My VMs & my Cluster IPs & my NFS IPs are on unique IP networks. I do this to keep unwanted/un-needed network chatter down to a minimum. I have never had a node drop out of the cluster , even with pushing 20+ Gig on multiple nodes in the cluster at the same time while all nodes are doing a backup at the same time.

North Idaho Tom Jones