Proxmox 4 HA VM Freeze State

dietmar · Nov 18, 2015

AhmedF said:
this is what I know in HA in versions 3.x , was this changed in 4.x ?

This is still the same behavior in 4.x

AhmedF · Nov 19, 2015

So what's about this freeze state , shouldn't the HA CTs moved to another node(s) when the node fails ?

dietmar · Nov 19, 2015

AhmedF said:
So what's about this freeze state

That state is only entered when you manually shutdown a node

Note: shutdown != failure

AhmedF said:
shouldn't the HA CTs moved to another node(s) when the node fails ?

Sure - this is exactly what happens when a node fail.

AhmedF · Nov 19, 2015

dietmar said:
That state is only entered when you manually shutdown a node

Note: shutdown != failure

Sure - this is exactly what happens when a node fail.

Agree but in proxmox 3.x when I shutdown a node manually and I did that a lot

RGMANAGER will relocate all HA CTs to other nodes first then stop and then the node complete it's manual shutdown.

is there a difference in proxmox 4.x ?

dietmar · Nov 19, 2015

AhmedF said:
A
is there a difference in proxmox 4.x ?

Yes, this is different.

tatyrza · Nov 22, 2015

Hello, I tested the HA in a way that reset (by IPMI) one of the hosts. VM running on this host wasn't relocated and put in error state. Worse, after returning this host on-line virtual machine could not run. It stuck in error state!. Only after a clean reboot of host, affected machine was start.

I do not like it. PVE 3.4 was better.

AhmedF · Nov 22, 2015

tatyrza said:
Hello, I tested the HA in a way that reset (by IPMI) one of the hosts. VM running on this host wasn't relocated and put in error state. Worse, after returning this host on-line virtual machine could not run. It stuck in error state!. Only after a clean reboot of host, affected machine was start.

I do not like it. PVE 3.4 was better.

I was afraid of such cases

In my cluster 3.x , if one node goes down for any reason it's being fenced and all CTs are moved to other nodes during the downtime automatically and that's what I call HA.

don't we all agree on that ?

t.lamprecht · Nov 22, 2015

AhmedF said:
I was afraid of such cases
In my cluster 3.x , if one node goes down for any reason it's being fenced and all CTs are moved to other nodes during the downtime automatically and that's what I call HA.

don't we all agree on that ?

First the error state was also in PVE 3.4 we adapted it from rgmanager with the same triggers.
It gets placed in an error state when the following things happen:
* The VM fails to stop - this is highly unlikely as we do a normal shutdown with a 60s timeout which then stops the VM via Qemu (and thats normally a secured stop)
* The VM cannot get started after all relocate and restart tries, read http://pve.proxmox.com/wiki/Manual:_ha-manager Especially the "RECOVERY POLICY" and "ERROR RECOVERY"

But you already should know that, I hope nobody who wants to do HA related stuff uses software without reading it's documentation first

So please check the logs and show what really happened, also describe your setup (how many nodes, which shared storage, ...).
At first I would suspect a misconfigured (at least in the sense of HA) VM, maybe with local storage or something other which binds the VM to a host.
And if it's a bug we also need this information so that we can reproduce it and fix it, thanks.

But a graceful shutdown is not "any reason", it's by no means a failure and it's planned we agree also on that? So we should not by default start automatic actions. I could imagine a ha-manager command which handles such a case would be better.
Or at least an option. I understand that is inconvenient, but there are workarounds (scripting the relocate, killing the lrm) and it's simply an opinion thing, where the lazy site argues for automatic relocation.

The solution to kill the VM and restart it on another host automatically is not a really clean solution, in my opinion.
Doing online migration, a quite clean solution, with possible hundreds of gigabyte ram would put a huge load on the infrastructure and should not be triggered automatically.

This is why we have made the freeze state, to not trigger some automatic action on a planned downtime event, with the though behind that human intelligence is far better to act and plan such downtimes and that and automatic relocate of a lot of VMs in an unnecessary case could do more harm than good.

tatyrza · Nov 23, 2015

Hello! I still use PVE4. What about UPS? I have three hosts, each with its own UPS. NUT program installed on each host, monitors the UPS. If one of the UPS loses power, the VMs will be moved?

t.lamprecht · Nov 23, 2015

Not automatically for now. You could write a small script which the NUT program calls on an failure event so that all VMs are migrated to another host.

adamb · Dec 1, 2015

t.lamprecht said:
First the error state was also in PVE 3.4 we adapted it from rgmanager with the same triggers.
It gets placed in an error state when the following things happen:
* The VM fails to stop - this is highly unlikely as we do a normal shutdown with a 60s timeout which then stops the VM via Qemu (and thats normally a secured stop)
* The VM cannot get started after all relocate and restart tries, read http://pve.proxmox.com/wiki/Manual:_ha-manager Especially the "RECOVERY POLICY" and "ERROR RECOVERY"

But you already should know that, I hope nobody who wants to do HA related stuff uses software without reading it's documentation first

So please check the logs and show what really happened, also describe your setup (how many nodes, which shared storage, ...).
At first I would suspect a misconfigured (at least in the sense of HA) VM, maybe with local storage or something other which binds the VM to a host.
And if it's a bug we also need this information so that we can reproduce it and fix it, thanks.

But a graceful shutdown is not "any reason", it's by no means a failure and it's planned we agree also on that? So we should not by default start automatic actions. I could imagine a ha-manager command which handles such a case would be better.
Or at least an option. I understand that is inconvenient, but there are workarounds (scripting the relocate, killing the lrm) and it's simply an opinion thing, where the lazy site argues for automatic relocation.

The solution to kill the VM and restart it on another host automatically is not a really clean solution, in my opinion.
Doing online migration, a quite clean solution, with possible hundreds of gigabyte ram would put a huge load on the infrastructure and should not be triggered automatically.

This is why we have made the freeze state, to not trigger some automatic action on a planned downtime event, with the though behind that human intelligence is far better to act and plan such downtimes and that and automatic relocate of a lot of VMs in an unnecessary case could do more harm than good.

I do understand the logic to a degree but I honestly don't see an issue with the logic behind 3.4. We support these cluster remotely. Monthly reboots of HA nodes has proven to be very valuable, including the failover of the VM when the reboot happens. These reboots weed out any random issues which could arise when a legit fail over is needed.

Obviously live migrating a VM with well over 700G of ram without human intervention just sounds crazy.

It would be great to have the option to choose how we want the VM's to be handled when a planned shutdown takes effect. 4.0 was a big enough change in itself, I don't think changing the logic was the best idea.

If I kill pve-ha-lrm, the node typically gets fenced. If I simply stop the service it puts the VM in a freeze state.

dietmar · Dec 1, 2015

adamb said:
It would be great to have the option to choose how we want the VM's to be handled when a planned shutdown takes effect.

We work on a patch with the following behavior:

system shutdown: stops VMs, then move them to other nodes

system reboot: stops VMs, put the into freeze state

adamb · Dec 1, 2015

dietmar said:
We work on a patch with the following behavior:

system shutdown: stops VMs, then move them to other nodes

system reboot: stops VMs, put the into freeze state

I appreciate the offer, but it wouldn't do us much good. We need the ability to choose this option for reboot more so than a shutdown.

dietmar · Dec 1, 2015

adamb said:
I appreciate the offer, but it wouldn't do us much good. We need the ability to choose this option for reboot more so than a shutdown.

I guess we can also make it configurable.

adamb · Dec 1, 2015

dietmar said:
I guess we can also make it configurable.

You guys are the best! Let me know if there is anything I can do.

AhmedF · Jan 6, 2016

I tried to stop rgmanager manually , it stopped all CTs in the node successfully but they didn't start on other nodes like configured with failover domains , can you please tell me how to fix this ?

Belokan · May 5, 2016

Hello all,

I'm quite new to proxmox and I directly started using 4.x (pvetest) for my lab@home few weeks ago.
I have a 2 nodes HA cluster with NFS/iSCSI storage provided by 2 NAS.

I've been able to configure my VMs to be live migrated in a few clicks (including storage), PM is quite impressive I admit.
I've added two VMs as HA resources, first one is a VDI used to remotely connect @office for support, and the second one provides DHCP/DNS/VPN/HTTP/... services to my local network ... So those two have to be up and running all the time.

But I've been quite "disappointed" when I first stopped (reboot then shutdown in order to be sure that the host reboots correctly before the maintenance) a PVE node hosting one of those HA VM for hardware upgrade. I was thinking that the HA VM should first be live migrated to the second node before shutting down. But it did not. I've been trained to VMware 4.x in the past but never had the chance to really work with it but part of my BAU is to configure/administer/maintain Veritas Clusters and in both cases (memories from VMW training and actual VCS behaviors) clusters always move resources before shutting down. Is it possible to implement this behavior in PM (with/without automatic fail-back option) ?

Another point is if I try to migrate a HA VM to another Node, it just does nothing:

Executing HA migrate for VM 104 to node pve1
TASK OK

104 is running on pve2 and even with an OK status, nothing happens. I have to remove the VM from HA and then I can migrate it manually.

Am I missing something here ?

Thanks !

Olivier

adamb · May 5, 2016

Belokan said:
Hello all,

I'm quite new to proxmox and I directly started using 4.x (pvetest) for my lab@home few weeks ago.
I have a 2 nodes HA cluster with NFS/iSCSI storage provided by 2 NAS.

I've been able to configure my VMs to be live migrated in a few clicks (including storage), PM is quite impressive I admit.
I've added two VMs as HA resources, first one is a VDI used to remotely connect @office for support, and the second one provides DHCP/DNS/VPN/HTTP/... services to my local network ... So those two have to be up and running all the time.

But I've been quite "disappointed" when I first stopped (reboot then shutdown in order to be sure that the host reboots correctly before the maintenance) a PVE node hosting one of those HA VM for hardware upgrade. I was thinking that the HA VM should first be live migrated to the second node before shutting down. But it did not. I've been trained to VMware 4.x in the past but never had the chance to really work with it but part of my BAU is to configure/administer/maintain Veritas Clusters and in both cases (memories from VMW training and actual VCS behaviors) clusters always move resources before shutting down. Is it possible to implement this behavior in PM (with/without automatic fail-back option) ?

Another point is if I try to migrate a HA VM to another Node, it just does nothing:

Executing HA migrate for VM 104 to node pve1
TASK OK

104 is running on pve2 and even with an OK status, nothing happens. I have to remove the VM from HA and then I can migrate it manually.

Am I missing something here ?

Thanks !

Olivier

If your using proxmox4 with two nodes, that is your issue. For HA you need 3 nodes total.

Belokan · May 5, 2016

Ok thanks, got the point (fencing) !
It explains the "with at least 3 nodes" requirement ...

But in my current situation, what will happen if a node crashes ? Will HA VMs restarted on the second node despite the "2 nodes" configuration or just nothing ?

Can I add a 3rd "virtual node" (installed on a VM hosted on one NAS for instance) ? And add it to the cluster together with the 2 physical nodes and then make sure that no VM will migrate on it ?

adamb · May 5, 2016

Belokan said:
Ok thanks, got the point (fencing) !
It explains the "with at least 3 nodes" requirement ...

But in my current situation, what will happen if a node crashes ? Will HA VMs restarted on the second node despite the "2 nodes" configuration or just nothing ?

Can I add a 3rd "virtual node" (installed on a VM hosted on one NAS for instance) ? And add it to the cluster together with the 2 physical nodes and then make sure that no VM will migrate on it ?

I believe nothing will happen as you don't have quorum. Would this third node be running on a VM as the two current hosts? If so that wouldn't do much, if its running on another host separate from these two it would be ok, but definitely not anything I would rely on.

Proxmox 4 HA VM Freeze State

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Renowned Member

Active Member

Famous Member

Active Member

Famous Member

We value your privacy