HA - howto's

asm0dai · Mar 14, 2023

Hello everyone.

I need some advise on the subject - I fear I don't fully understand how HA is working in Proxmox.
What have been done:
1) I have 4 servers (nodes) configured as cluster;
2) some SATA SSDs on them are configured as ceph-storage;
3) Ceph is running and monitoring nodes;
4) there is one test vm (windows based) running on one of the nodes;

What I hope to achieve (algorithm): when one node running the VM is going down (shut down "normally" or powered off due to some power shortage or anything else which can be described as "alarm situation") - the VM should be migrated to another node keeping its running state. Is that possible?

From my tests I can see that VM is automatically migrated from the shut down node to the working one but the VM is reseted/rebooted after the migration (not "keeped alive" if you wish). "HA settings" in "Datacenter" is set to "shutdown_policy=migrate".

I suspect that this goal may be achieved in combination with the replication of the VM in question to the node, where it migrates - but I am not sure and documentation samples aren't very clear.

Could someone help me?

aaron · Mar 14, 2023

As you already discovered, the normal shutdown/reboot case works as you expect with the shutdown policy set to migrate.

The case where a node is failing, or losing the corosync connection to the remaining cluster, the guest will be down for a short time. About 2 minutes after the failing node has been last seen, the remaining nodes in the cluster will recover the HA guests that were on the failed node.

If you need to have even more uptime, check if you can't do it on the application level.

B.Otto · Mar 14, 2023

In general HA means that in a fault scenario the VM will get rebooted on another node in the cluster. This requires the VM to be on a shared storage, like Ceph. This does mean a small downtime (2-3 minutes) and the VM will be rebooted, losing its running state.

Keeping the running state alive would require that the contents of the RAM to be mirrored in real time, and that is quite a hassle (just compare the latency of CPU<>RAM to the latency of basic networking).

Do note that this definition of HA is how it is commonly used in IT and is not Proxmox-exclusive. For example, what VMware calls 'High-Availability' is exactly this: rebooting the VM on failure. There is a VMware-Feature called 'Fault-Tolerance', but that is rarely used because of its downsides - and the expensive license needed for it.

Search

Search

HA - howto's

asm0dai

New Member

aaron

Proxmox Staff Member

B.Otto

Active Member