VM & LXC availability through a dirty host shutdown?

Nicoloks

New Member
Apr 26, 2024
3
0
1
Hi All,

How long should I be expecting a PVE cluster to take to bring up VM and LXC instances on other hosts when the host they are currently on has a dirty shutdown (power cord out, etc)?

I've setup a 3 node cluster with ceph using identical hardware on each node for testing. PVE cluster network is only 1Gbps, however Ceph network is 10Gbps. I have a HA group setup with the HA shutdown policy set to migrate. All running VM and LXC instances have HA set to be a member of this HA group with a running state.

When I gracefully shutdown/reboot a host all the VM and LXC instances hosting within are seamlessly migrated without so much as dropping a single packet from a continuous ping run externally from the cluster. When I test a host outage by pulling the power cord it seems to take a long time (as in minutes, I've yet to time it exactly though) for the VM & LXC instances to be brought back online.

Is this to be expected from this sort of host outage, or are there configuration steps I need to be taking to ensure this downtime is minimised?
 
it seems like that is roughly normal, depending on your hardware it is being read from, if you have it on NVME it might be longer than usual. but after a power cut, anything that just comes right back on in a timely manner is always a good thing. although personally i have never timed mine, they were always up when i finish booting.

worst i have had after a power cut or issues, was proxmox taking hours to come back online but that was a configuration issue with drives that caused it for some reason to only take 3-4 hours to boot after a unclean shutdown unless i restarted externals, etc.

you could look in your bios to tweak settings to shorten boot times, cut a few seconds off maybe, you can check systemd-analyze critical-chain and systemd-analyze blame for anything causing your system to take longer than usual. usually proxmox is pretty good in general.

of course you can also always make general optimizations to the LXCs to reduce their boot time. (i usually always set static IPs to my LXCs and VMs, which seems to help in general)

theres also this thread here Slow boot times with lxc where someone states their LXCs take minutes to boot and theres a fix to bring it down to about 10 seconds, turns out its a network issue in those cases at least.
 
Is this to be expected from this sort of host outage,
Yes. The remaining hosts wait a minute for the lost node to come back - perhaps it was just a glitch on a network connection and that node is actually still running. Who knows?

After (up to) ~two minutes HA-VMs should start on another host. There is a detailed description of a staff member somewhere in this forum...

I've setup a 3 node cluster with ceph
This is the absolute minimum and you are degraded (and stay degraded) as soon as a single node dies.

I've dropped Ceph, but some of my notes are here: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
 
Last edited:
like UdoB wrote - a migration as part of a regular shutdown, or a HA failover are two very different things. HA failover needs to wait for the watchdog to have expired (this takes +-60s) for the failed node to be considered fenced, and then it takes a little bit to handle the recovery, so yes, something like 2 minutes is not unexpected.