[SOLVED] Optimum Method to Test Proxmox HA

Nov 11, 2020
11
1
3
33
Good morning all,
I have set up my Proxmox 6.2-12 nodes and enabled high availability for my VMs and containers and now I would like to test the HA functionality. I tried rebooting a node but quickly found this does not work - the VMs remain in a frozen state as the shutdown is graceful. Short of pulling the power to the servers, is there a more elegant and controlled way of testing the HA?

Ideally id like to trigger the HA functionality and watch proxmox move the VMs to another node and when I trigger reverse failover see them move back to the original node.

Best wishes,
Sean
 
Right so it seems to me that these settings can be controlled in /etc/pve/datacenter.cfg and by adding
Code:
ha: shutdown_policy=failover
Will cause the desired behaviour to be observed. I will test this and mark the thread solved if this indeed is the case.
 
You could also just pull the network cable which is used for the Cluster communication (Corosync). With that you should even see how the node will fence (hard reset) itself after about 2 minutes.
 
nice idea Aaron, is there a way of setting that watchdog timer lower?
No, the timers are hard coded. The HA process works roughly like this:
  1. a node loses contact to the quorum part of the corosync cluster
  2. if it cannot reconnect within 2 minutes, it will fence itself if it has (or had since the last boot) HA guests running on it
  3. after another minute (~3 minutes after node lost contact to the quorum cluster) the HA guests will be started on the remaining nodes
This is also why it is important to have a dedicated low latency network for corosync if you use HA. If for example, you share the same physical network with corosync and backup traffic, it could happen, that the backups are congesting the network and the latency for corosync is rising to the point where corosync considers the connection to be lost. The worst case scenario is that this happens for all nodes in the cluster and you will see the whole cluster restarting.

If you need to run HA services that cannot be down for a few minutes, you need to configure the HA in the application/service level.
 
Thanks Aaron, that is very clear. I have a separate fast network for storage/backup traffic so the corosync/data network is never congested. To summarize this for posterity (please correct me if I am wrong):
1. Graceful reboot does not automatically fail-over HA as the ha config in /etc/pve/datacenter.cfg is by default set to conditional
2. Setting /etc/pve/datacenter.cfg ha: to failover will cause nodes to be restarted on another machine in poweroff event
3. Loss of corosync network sync will cause node to fence itself by rebooting and Proxmox will start HA services on another node

Best wishes,
Sean
 
I have a separate fast network for storage/backup traffic so the corosync/data network is never congested.
Good, just to be clear, our official recommendation is to have at least one physical network dedicated only to corosync to avoid any suprises if one of the other services does take the whole bandwidth for some reason. Without HA this might be a nuisance but with HA enabled, this can be problematic very quickly.


Your summary is accurate AFAICT. You can set the different shutdown policies [0] in the GUI in the Datacenter->Options panel in the "HA settings" entry.


[0] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_shutdown_policy
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!