Server going offline, but VM's not migrating

millenium7

New Member
Apr 21, 2025
1
0
0
I have a 3 node proxmox cluster that is setup with HA using ZFS replication. Everything is automatically backed up, replication is setup for all VM's and everything works as expected.... most of the time. I.e. ripping the power cord out from 1 will eventually migrate VM's. If I shut down one, it all migrates, everything is seemingly working as expected
However 1 node in particular (lets call it NodeB) is a little less reliable and occasionally it'll hard lock. The other 2 nodes will mark is as either offline or unknown yet all VM's that were running on NodeB stay there, hence are completely offline and unreachable

NodeB does have Intel vPro so I can login to it (often the screen is just hard locked and completely unresponsive) and can also power cycle it which brings it back online, but this is completely missing the whole 'High Availability' aspect if the VM's are unavailable until I do so.

What else can I do here to improve the reliability? If the proxmox service is unresponsive I want it marked as down and everything migrated. I can deal with the node going offline once every few months, can't deal with having to wait until someone reports a VM offline to realize and have to manually intervene
What mechanisms does Proxmox use to tell if a node is actually dead or not? Could the physical interface status (which would be up for vPro to work) be causing it get stuck and not consider it truly down? does it try and poll the proxmox service? Is it a simple ping?
 
Last edited: