HA - howto's

asm0dai

New Member
Mar 14, 2023
3
0
1
Hello everyone.

I need some advise on the subject - I fear I don't fully understand how HA is working in Proxmox.
What have been done:
1) I have 4 servers (nodes) configured as cluster;
2) some SATA SSDs on them are configured as ceph-storage;
3) Ceph is running and monitoring nodes;
4) there is one test vm (windows based) running on one of the nodes;

What I hope to achieve (algorithm): when one node running the VM is going down (shut down "normally" or powered off due to some power shortage or anything else which can be described as "alarm situation") - the VM should be migrated to another node keeping its running state. Is that possible?

From my tests I can see that VM is automatically migrated from the shut down node to the working one but the VM is reseted/rebooted after the migration (not "keeped alive" if you wish). "HA settings" in "Datacenter" is set to "shutdown_policy=migrate".

I suspect that this goal may be achieved in combination with the replication of the VM in question to the node, where it migrates - but I am not sure and documentation samples aren't very clear.

Could someone help me?
 
As you already discovered, the normal shutdown/reboot case works as you expect with the shutdown policy set to migrate.

The case where a node is failing, or losing the corosync connection to the remaining cluster, the guest will be down for a short time. About 2 minutes after the failing node has been last seen, the remaining nodes in the cluster will recover the HA guests that were on the failed node.

If you need to have even more uptime, check if you can't do it on the application level.
 
In general HA means that in a fault scenario the VM will get rebooted on another node in the cluster. This requires the VM to be on a shared storage, like Ceph. This does mean a small downtime (2-3 minutes) and the VM will be rebooted, losing its running state.

Keeping the running state alive would require that the contents of the RAM to be mirrored in real time, and that is quite a hassle (just compare the latency of CPU<>RAM to the latency of basic networking).

Do note that this definition of HA is how it is commonly used in IT and is not Proxmox-exclusive. For example, what VMware calls 'High-Availability' is exactly this: rebooting the VM on failure. There is a VMware-Feature called 'Fault-Tolerance', but that is rarely used because of its downsides - and the expensive license needed for it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!