[SOLVED] Understanding HA

dura-zell · Jan 24, 2024

Hi

I'm trying to understand HA and if / how it fits my needs. To learn more, I set up a testbed consisting of 3 nodes. I don't have any real server lying around - so for testing some SFF Desktops need to suffice. (HP Prodesk 400 G3 and a Dell Optiplex with identical specs, each with two drives, one for proxmox, one for ceph. Unfortunately I can't fit additionaly network cards into these devices but for testing the single 1Gb interface should work (at least I think so).

I succesfully configured ceph for shared storage and installed some VM's. Networking is done via SDN and works as expected. One of the VM's is an Opensense for connectivity to the outside world, the other ones are a typical small windows domain (2 DC, 1 generic server, 1 client). I did not test any linux vm's yet but I expect don't any issues / differences there.

I can migrate VM's without issue. Also, when I command a node to shutdown, it migrates all VM's to other nodes as expected.

What I don't understand:
I simulated a node failing by unplugging the network cable. It quickly showed as offline in the Web-GUI.
Now either another node should simply take over (if RAM is synced over the net) or at least the VM's should have been restarted on another node.
But in fact: After several minutes, nothing happened. I replugged the node and everything went back to normal. I could tolerate the VM's rebooting somewhere else but several minutes downtime and only coming back after manual intervention (=fixing the fault) is a bit too long for me.

I think I did something wrong but neither the docs nor google gave me a hint. Did I overlook some config, did I missunderstood proxmox HA entirely or something else?

Not sure which info about the system might help - just let me know, I'm happily providing anything needed.

Greetings,
Dura

aaron · Jan 24, 2024

Did you configure some guests as HA? They are not automatically HA guests.

HA works roughly like this:
Nodes with HA guests running on them report their presence every 10 seconds to the PVE cluster.
If the node dies or the network connection is broken, the remaining cluster is waiting for some time (~2 minutes) in case the node comes back (broken network).

If the node is not back, the HA guests which used to run on the failed node will be recovered (started) in the remaining cluster. For that, all resources for the guest need to be available. Mainly disk images, but passed through devices also need to be considered if it should work.

To get the disks available on all nodes you have options. A shared storage on the network (e.g. NFS, Samba) or Ceph are good options. But local ZFS + Replication can also be used. Though in the latter case, since the replication is async, you might see some data loss depending on how long ago the last successful replication was.

Once you have HA guests running on a node (Datacenter -> HA shows the node as "active"), a stable corosync connection is important!

Therefore, it is best practice to have at least one dedicated network just for Corosync alone. It is used for the PVE cluster communication and involved in the nodes reporting them being alive. You can configure additional Corosync links initially or later. Corosync can switch between networks and will do so if one becomes unusable.

Be prepared for some unexpected reboots of your nodes if you set the guests to HA if everything is going over a single gbit wire. The storage (Ceph, Backup target, ...) can easily congest that connection and the latency for Corosync will go up. Maybe to the point where it deems the connection unusable.

If that happens, the node will wait for about one minute to see if it can reconnect to the quorate part of the cluster. If so, everything is fine. If not, it will fence itself. That is basically the same as pushing the reset button on the machine.
It does so to make sure that the HA guests are definitely powered off before the (hopefully) remaining rest of the cluster recovers them.

If all (or the majority of the) nodes in the cluster experience the same problems, it will look like the cluster rebooted out of the blue. Typical situations would be, when Corosync only has one network configured and is sharing the physical network with other services that can take up a lot of bandwidth, congesting the network. Such services as anything storage or backup related for example.

dura-zell said:
Now either another node should simply take over (if RAM is synced over the net)

That is currently not possible, to run a VM in lockstep on two different nodes. The feature in KVM is called COLO and is not yet production ready. And once it will be, using it will most likely come with some hard requirements for a fast, low latency network connection between the two nodes to not slow down the VM too much as a lot of state needs to be synced all the time.

I hope that explains it better

dura-zell · Jan 25, 2024

Hi Aaron

Thanks for this extensive explanation

My VM's are configured as HA. I defined a HA-Group, added all nodes and then configured each VM as HA. HA policy is set to "migrate" which is fine for "planned" shutdown/reboots. Any status page on the UI and some commands in the shell is showing "green".

I know about the requirements for a fast and reliable network. Actually I was very surprised how smooth the VM's worked despite only having only a single 1G link for "everything". As said - this is only for testing and I can't fit any more NIC's into these computers. For the "real thing" I'm planning with proper servers (probably DL380 G10 or equivalent from Dell). Also I'm monitoring the switch - most of the time the interfaces are below the 200Mbit mark, with peaks up to 600Mbit during OS installation or while migrating.

I will test this again tomorrow - maybe I wasn't patient enough. And at least I have confirmation that RAM is not synced. Is there a "knob" to decrease the time the cluster waits before starting the VM on a working node?
Also: Without a fencing device (something I didn't even know existed before playing around with proxmox): How does the cluster react when the node comes back?

Greetings,
Dura

aaron · Jan 25, 2024

dura-zell said:
maybe I wasn't patient enough.

Seconds become long when you wait for them to pass

dura-zell said:
Is there a "knob" to decrease the time the cluster waits before starting the VM on a working node?

No these values are hard coded.

dura-zell said:
Also: Without a fencing device (something I didn't even know existed before playing around with proxmox): How does the cluster react when the node comes back?

The node will rejoin the cluster and the VMs might be migrated back to the node. The HA groups "restricted" and "nofailback" options are how you can manually control the behavior.

Just for completeness sake, HA groups are not necessary but very useful is the behavior needs to be controlled in more detail, for example when you have resources for the VM that are only available on select nodes.

HA groups with node selection and the "restricted" checkbox can be used to make sure that two VMs never run on the same nodes. Useful if you do HA on the application layer.

esi_y · Jan 25, 2024

aaron said:
dura-zell said:

Is there a "knob" to decrease the time the cluster waits before starting the VM on a working node?

Click to expand...

No these values are hard coded.

For example my favourite:
https://github.com/proxmox/pve-ha-m...cdb2d0a37d47e0464/src/PVE/HA/NodeStatus.pm#L8

Code:

my $fence_delay = 60;

You'll find these in /usr/share/perl5/PVE.

@aaron Apologies in advance.

dura-zell · Jan 25, 2024

Hi all

thanks again for the explanation - really appreciated.

I did test again and HA behaved as expected - Either I was not patient enough (Yes: Seconds can become Minutes when waiting for something like this

) or somehting I tried after my initial tests made it work. Either way, it works and I like it really much.
About the HA Groups: I tried it with them as the "real" setup would involve 2 clusters with 3 nodes each (to have two completely seperate environments). Planning is not complete but I'd rather would manage this via a single datacenter than via two of them.

tempacc346235: Thanks - I did not yet had time to get to the guts of everything. I think I will play a bit with my test-bed but on a production system I'd rather not modify the code as it is likely to be overwritten during updates (or breaks after updates).

Not sure if there is a button for this but I think this thread can be marked as solved

Greetings,

aaron · Jan 25, 2024

dura-zell said:
Not sure if there is a button for this but I think this thread can be marked as solved

You can edit the first post and set the title prefix

dura-zell · Jan 25, 2024

Been there, done that.

Thanks again.

Greetings,
Dura

[SOLVED] Understanding HA

dura-zell

New Member

aaron

Proxmox Staff Member

dura-zell

New Member

aaron

Proxmox Staff Member

esi_y

Renowned Member

dura-zell

New Member

aaron

Proxmox Staff Member

dura-zell

New Member

We value your privacy