Proxmox 4 HA VM Freeze State

adamb · Nov 16, 2015

When I shutdown a host which is running HA VM it puts the VM in a "freezed" state until the node comes back online. This is not what we are use to or expect from a HA cluster. Is there a way to ensure the VM gets moved to the other available node instead of waiting for the other node to come back online? Some of these servers take 5-10 minutes+ to reboot.

dietmar · Nov 16, 2015

adamb said:
When I shutdown a host which is running HA VM it puts the VM in a "freezed" state until the node comes back online. This is not what we are use to or expect from a HA cluster.

This is how it is implemented currently (but why would you shutdown a node used for HA?)

adamb · Nov 16, 2015

dietmar said:
This is how it is implemented currently (but why would you shutdown a node used for HA?)

We reboot one of our HA nodes monthly to ensure there are no random issues. We support doctor offices and hospitals so our systems being 100% is very critical. We have found that reboot's typically bring out issues with configuration changes and things of that nature. Or we reboot nodes for maintenance/updates. With the amount of time it takes to reboot our hardware, starting on the other remaining node is critical to our operation. It could be the difference in 10 minutes of downtime or 2 minutes.

Obviously we can come up with new methods by simple migrating the VM before rebooting the node, but this is a bit more complicated. It would be nice to have the option to choose.

sigxcpu · Nov 16, 2015

dietmar said:
This is how it is implemented currently (but why would you shutdown a node used for HA?)

Then we can call it HA* or Best-Effort Availability

adamb · Nov 16, 2015

sigxcpu said:
Then we can call it HA* or Best-Effort Availability

Question the idea of rebooting all you want, but we have been a IBM/HP shop for well over 25 years and have a very good understanding of how to support 1000's of servers out in the field. Rebooting has always been one of those things that prevent issues.

sigxcpu · Nov 16, 2015

adamb said:
Question the idea of rebooting all you want, but we have been a IBM/HP shop for well over 25 years and have a very good understanding of how to support 1000's of servers out in the field. Rebooting has always been one of those things that prevent issues.

Actually I was making fun of Proxmox's HA, not your issue. The concept of *High* Availability is not very compatible with 10-15 minutes of downtime because of a (planned) host reboot. Btw, you should try a BIG server reboot (we have a 1.5TB RAM machine) to find out that 15 minutes of reboot time is very fast

adamb · Nov 16, 2015

sigxcpu said:
Actually I was making fun of Proxmox's HA, not your issue. The concept of *High* Availability is not very compatible with 10-15 minutes of downtime because of a (planned) host reboot. Btw, you should try a BIG server reboot (we have a 1.5TB RAM machine) to find out that 15 minutes of reboot time is very fast

My mistake I shouldn't have jumped to conclusions, we always get flack for our reboot schedule. My bad!

We actually sell our HA servers with either 768GB or 1.5TB

. We know all about the painful reboot's! We disable all the PXE's and optimize the boot process as much as possible but its still painful. The M3 line of IBM's were the worst, 15-20 minutes, it was insane.

dietmar · Nov 16, 2015

sigxcpu said:
The concept of *High* Availability is not very compatible with 10-15 minutes of downtime because of a (planned) host reboot.

But If you plan to shutdown a server (you state to have a plan), you can also simply move the VMs to another server. I can't see why this need to be done automatically.
Also, most time I reboot a server, it is online again within 30 second ...

adamb · Nov 16, 2015

dietmar said:
But If you plan to shutdown a server (you state to have a plan), you can also simply move the VMs to another server. I can't see why this need to be done automatically.
Also, most time I reboot a server, it is online again within 30 second ...

You obviously aren't running servers with large amounts of ram. They take significantly longer to boot than servers without loaded ram. Just because its a planned event doesn't mean its not automated. Moving the VM would be another step to the process and honestly im not a huge fan of automating live migration with no human intervention, it just sounds like a bad idea. I don't see the logic in freezing a VM while a host reboots, what benefit does this even provide?

sigxcpu · Nov 16, 2015

dietmar said:
Also, most time I reboot a server, it is online again within 30 second ...

Read above about big servers. 30 seconds is for my laptop. For a big memory Dell, 25 minutes is for "Configuring Memory" only. Then you add checks, PCIe devices init, OS startup, VMs startup.

dietmar · Nov 16, 2015

adamb said:
Just because its a planned event doesn't mean its not automated. Moving the VM would be another step to the process and honestly im not a huge fan of automating live migration with no human intervention, it just sounds like a bad idea. I don't see the logic in freezing a VM while a host reboots, what benefit does this even provide?

I think it is good idea, because it is much safer to move a HA enabled VM manually:

1.) you can carefully select the target node - using human intelligence ;-)
2.) you can verify that everything went well

Besides, I also accept patches to implement other behavior...

dietmar · Nov 16, 2015

sigxcpu said:
For a big memory Dell, 25 minutes is for "Configuring Memory" only. Then you add checks, PCIe devices init, OS startup, VMs startup.

And live migrating all those memory is faster?

sigxcpu · Nov 16, 2015

Live migrating, no, for sure. Restarting the VM on another node, yes.

adamb · Nov 16, 2015

dietmar said:
I think it is good idea, because it is much safer to move a HA enabled VM manually:

1.) you can carefully select the target node - using human intelligence ;-)
2.) you can verify that everything went well

Besides, I also accept patches to implement other behavior...

I disagree but im not much of a programmer. Our only option will be to use "fence_node" instead of shutting down gracefully.

1. I have no need to carefully select a target node when there is only one it can run on.
2. We have other checks and scripts in place to ensure it went well and the VM is running

t.lamprecht · Nov 16, 2015

adamb said:
Moving the VM would be another step to the process and honestly im not a huge fan of automating live migration with no human intervention, it just sounds like a bad idea. I don't see the logic in freezing a VM while a host reboots, what benefit does this even provide?

Automatically moving the VM would have the same problems, no human intervention. Calling a script (for example) which does that would trigger the same behaviour as when we would implement it in the HA manager, I don't see why one should be safer than the other. But I could see that for some admins it would be more comfortable.

The logic is that a reboot is a planned action and we do not want trigger automatic things on such an action (a reason you mentioned also), also a possible out of control feed back loop should be avoided.

Manually unfreezing a service (e.g.: to a machine) should be thought about, but it's not that simple.

(Not the nicest and "no warranty") Work around for you, kill the pve-ha-lrm process and then reboot, the services will then be relocated.

adamb · Nov 16, 2015

t.lamprecht said:
Automatically moving the VM would have the same problems, no human intervention. Calling a script (for example) which does that would trigger the same behaviour as when we would implement it in the HA manager, I don't see why one should be safer than the other. But I could see that for some admins it would be more comfortable.

The logic is that a reboot is a planned action and we do not want trigger automatic things on such an action (a reason you mentioned also), also a possible out of control feed back loop should be avoided.

Manually unfreezing a service (e.g.: to a machine) should be thought about, but it's not that simple.

(Not the nicest and "no warranty") Work around for you, kill the pve-ha-lrm process and then reboot, the services will then be relocated.

Good info, I appreciate it!

sigxcpu · Nov 16, 2015

BTW, "freeze" is KVM/QEMU suspend? I want a button for that for non-HA VMs

t.lamprecht · Nov 16, 2015

No, freeze is only a state in our ha-manager logic, it has no effect on the machine itself. It will only prevent actions from the Cluster Resource Manager until the previously gracefully powered down machine and its Local Resource Manager is online again.

Edit: and it's already possible to freeze all KVM/QEMU machines

sigxcpu · Nov 16, 2015

Right. I feel stupid now. I usually use the top right menu in UI, not the contextual one.

AhmedF · Nov 18, 2015

Hmm that's a new behavior to me in HA , suppose we had a node down for any random power problems , HA used to fence this node and move all the VMs to another nodes depending on the failover domain setup in cluster.conf
when the original node comes back online , it's all about the "nofailback" parameter if to move the CTs back or keep them in the running node.

this is what I know in HA in versions 3.x , was this changed in 4.x ?

Proxmox 4 HA VM Freeze State

Famous Member

Proxmox Staff Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Proxmox Staff Member

Famous Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Famous Member

Proxmox Staff Member

Famous Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Renowned Member

We value your privacy