Hi forum.
I think I have encountered a problem (I wouldn't call a bug) here: I got a cluster (6.4) hosted in OVH service provider: So far, so good.
Today, they're doing maintenance on one of the nodes... no problem! the cluster has proven to me it is capable of handling that ... well... no, there's a problem indeed: The cluster is OK, but affected VMs in HA mode at the node, do not recover in other node, instead, they get in 'Freeze' mode...
It is OK that a reboot on the node may not result in moving VMs around, since it is gonna return online soon, I LIKE this feature, it is useful, but...what if something goes wrong during reboot?
In this case, OVH reboots the hardware servers to a 'Maintenance' Image during the maintenance (in this case an electrical maintenance) so, the last thing the node does within the cluster...is a reboot... but it never returns from it.
I mean... The feature is OK, I wouldn't like to be unable to reboot a node thinking on all the unnecesary VM movement activity thot would trigger... but I guess It would be also OK to consider the node completelly lost after a certain amount of time... It's not uncommon that a reboot operation leads to system that doesn't come up again: faulty PSU, some MBR corruption, etc... I think this feature, without further control granularity, can lead to very very nasty situations:
I would like to be able to state that a reboot longer that 10 minutes (for example) means the node could be considered 'down'
Is there any way of 'unfreeze' the VMs 'elegantly'? I mean, without some hacky maneovre that risks the cluster integrity?
UPDATE:
The cluster Status states the node Status as: old timestamp - dead? with a timestamp almost 1 and a half hour old ... why doesn't it realize it IS dead? ... This time I'm starting to feel the HA/Cluster feature is not working as it should (and did in the past)
Thanks.
Regards.
I think I have encountered a problem (I wouldn't call a bug) here: I got a cluster (6.4) hosted in OVH service provider: So far, so good.
Today, they're doing maintenance on one of the nodes... no problem! the cluster has proven to me it is capable of handling that ... well... no, there's a problem indeed: The cluster is OK, but affected VMs in HA mode at the node, do not recover in other node, instead, they get in 'Freeze' mode...
It is OK that a reboot on the node may not result in moving VMs around, since it is gonna return online soon, I LIKE this feature, it is useful, but...what if something goes wrong during reboot?
In this case, OVH reboots the hardware servers to a 'Maintenance' Image during the maintenance (in this case an electrical maintenance) so, the last thing the node does within the cluster...is a reboot... but it never returns from it.
I mean... The feature is OK, I wouldn't like to be unable to reboot a node thinking on all the unnecesary VM movement activity thot would trigger... but I guess It would be also OK to consider the node completelly lost after a certain amount of time... It's not uncommon that a reboot operation leads to system that doesn't come up again: faulty PSU, some MBR corruption, etc... I think this feature, without further control granularity, can lead to very very nasty situations:
I would like to be able to state that a reboot longer that 10 minutes (for example) means the node could be considered 'down'
Is there any way of 'unfreeze' the VMs 'elegantly'? I mean, without some hacky maneovre that risks the cluster integrity?
UPDATE:
The cluster Status states the node Status as: old timestamp - dead? with a timestamp almost 1 and a half hour old ... why doesn't it realize it IS dead? ... This time I'm starting to feel the HA/Cluster feature is not working as it should (and did in the past)
Thanks.
Regards.
Last edited: