Node going offline after setting up VMs to migrate

digipen79

New Member
Jan 30, 2025
20
2
3
I have finally gotten to a point where everything in my network is ready for me to set up clustering and HA. I have a cluster up and running with 2 nodes so far, as well as a qdevice. When I go to migrate VMs, however, the node I am moving the VMs to goes offline. Any thoughts on what may cause this? I have left everything as-is to troubleshoot, but can reboot if necessary.
 
What does it mean "goes offline"? Does it reboot?

How is the network set up? Do you have a dedicated physical network just for corosync (PVE cluster communication) or is it sharing that with the network used for the live migration? Unless specified in the DC -> options, the main mgmt (IP of the host) is used for the live migration.
 
HA went offline specifically. I have 2 network connections on each node, one going to my regular network connected to the Internet, and one specifically for the nodes themselves, both 10G connections. I was able to get the node back online by rebooting, and it has been rock solid ever since.
 
How many corosync networks do you have configured?
If it is only one, chances are high that it is the same network used for the live migration. Then the live migration most likely consumed all the bandwidth. In turn, the latency for Corosync went up to the point where it deemed the connection unusable.
If there is no other connection to fall back to, and if the situation persists for more than a minute, the node with HA guests on it (Datacenter -> HA shows the LRM service for that node as "active"), will fence itself (hard reset) to make sure the HA guests are definitely off before the – hopefully – still running remaining nodes can recover these guests.

Best practice is to have one dedicated physical network just for Corosync alone to avoid such congestion problem by other services/tasks. And to have multiple Corosync networks, in case one has issues. Corosync will switch between networks by itself.
You can configure up to 8 networks for Corosync to use.
See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy on how you can add more networks to Corosync on an existing cluster.