Node down, HA VMs stuck in Freeze state.

May 23, 2012
19
0
41
Hi forum.

I think I have encountered a problem (I wouldn't call a bug) here: I got a cluster (6.4) hosted in OVH service provider: So far, so good.
Today, they're doing maintenance on one of the nodes... no problem! the cluster has proven to me it is capable of handling that ... well... no, there's a problem indeed: The cluster is OK, but affected VMs in HA mode at the node, do not recover in other node, instead, they get in 'Freeze' mode...

It is OK that a reboot on the node may not result in moving VMs around, since it is gonna return online soon, I LIKE this feature, it is useful, but...what if something goes wrong during reboot?
In this case, OVH reboots the hardware servers to a 'Maintenance' Image during the maintenance (in this case an electrical maintenance) so, the last thing the node does within the cluster...is a reboot... but it never returns from it.

I mean... The feature is OK, I wouldn't like to be unable to reboot a node thinking on all the unnecesary VM movement activity thot would trigger... but I guess It would be also OK to consider the node completelly lost after a certain amount of time... It's not uncommon that a reboot operation leads to system that doesn't come up again: faulty PSU, some MBR corruption, etc... I think this feature, without further control granularity, can lead to very very nasty situations:
I would like to be able to state that a reboot longer that 10 minutes (for example) means the node could be considered 'down'

Is there any way of 'unfreeze' the VMs 'elegantly'? I mean, without some hacky maneovre that risks the cluster integrity?


UPDATE:
The cluster Status states the node Status as: old timestamp - dead? with a timestamp almost 1 and a half hour old ... why doesn't it realize it IS dead? ... This time I'm starting to feel the HA/Cluster feature is not working as it should (and did in the past)


Thanks.
Regards.
 
Last edited:
Hi!

The intervention on the node ended, and it re-joined the cluster flawlessly... however, I'm worried about this... this is an opportunity to learn and do better:

Currently Datacenter -> Options -> HASettings is set as 'default'

I must investigate on the link you point me ...

Thank you!
 
After reading that page there are some questions though about the 'Failover policy'

- Does the eventually migrated VMs do go back to the original node after a downtime?
- What does 'soon' mean in this context? ... my home-lab PROXMOX install on an ITX mobo boots-up fairly quicly, whereas our OVH dedicated servers do last several minutes to complete a reboot

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!