Node down, HA VMs stuck in Freeze state.

iccbroadcast · Jul 22, 2021

Hi forum.

I think I have encountered a problem (I wouldn't call a bug) here: I got a cluster (6.4) hosted in OVH service provider: So far, so good.
Today, they're doing maintenance on one of the nodes... no problem! the cluster has proven to me it is capable of handling that ... well... no, there's a problem indeed: The cluster is OK, but affected VMs in HA mode at the node, do not recover in other node, instead, they get in 'Freeze' mode...

It is OK that a reboot on the node may not result in moving VMs around, since it is gonna return online soon, I LIKE this feature, it is useful, but...what if something goes wrong during reboot?
In this case, OVH reboots the hardware servers to a 'Maintenance' Image during the maintenance (in this case an electrical maintenance) so, the last thing the node does within the cluster...is a reboot... but it never returns from it.

I mean... The feature is OK, I wouldn't like to be unable to reboot a node thinking on all the unnecesary VM movement activity thot would trigger... but I guess It would be also OK to consider the node completelly lost after a certain amount of time... It's not uncommon that a reboot operation leads to system that doesn't come up again: faulty PSU, some MBR corruption, etc... I think this feature, without further control granularity, can lead to very very nasty situations:
I would like to be able to state that a reboot longer that 10 minutes (for example) means the node could be considered 'down'

Is there any way of 'unfreeze' the VMs 'elegantly'? I mean, without some hacky maneovre that risks the cluster integrity?

UPDATE:
The cluster Status states the node Status as: old timestamp - dead? with a timestamp almost 1 and a half hour old ... why doesn't it realize it IS dead? ... This time I'm starting to feel the HA/Cluster feature is not working as it should (and did in the past)

Thanks.
Regards.

aaron · Jul 22, 2021

How is the Failover policy configured in the cluster? (Datacenter -> Options -> HA Settings).

For your use case you might want to set it to Failover, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_node_maintenance

iccbroadcast · Jul 22, 2021

Hi!

The intervention on the node ended, and it re-joined the cluster flawlessly... however, I'm worried about this... this is an opportunity to learn and do better:

Currently Datacenter -> Options -> HASettings is set as 'default'

I must investigate on the link you point me ...

Thank you!

iccbroadcast · Jul 22, 2021

After reading that page there are some questions though about the 'Failover policy'

- Does the eventually migrated VMs do go back to the original node after a downtime?
- What does 'soon' mean in this context? ... my home-lab PROXMOX install on an ITX mobo boots-up fairly quicly, whereas our OVH dedicated servers do last several minutes to complete a reboot

Thanks!

Search

Search

Node down, HA VMs stuck in Freeze state.

iccbroadcast

Active Member

aaron

Proxmox Staff Member

iccbroadcast

Active Member

iccbroadcast

Active Member