[SOLVED] Faster failover - possible?

Razva · Sep 12, 2017

Right now it takes about 200 seconds for a VM to be restarted from a failed node. Is there any way to speedup the process? I'm looking for something more like 80-120 seconds, from the point on which the node gets disconnected from the cluster 'til the point on which the VM is fully started on a new node. Is this possible?

t.lamprecht · Sep 12, 2017

Fencing need currently ~120 seconds because thats the time where our distributed cluster lock time out.
Basically its, after 60 seconds the outage detection triggers, form now on the current master tries to get the lock.
Worst case he needs another 60 seconds for that (i.e. the lost node renewed its locked directly before it crashed), then recovery may add a few seconds but it should be in the range of 5 to 10 seconds.

The cluster lock timeout period is currently almost hard coded, so you cannot change that easily at the moment - at least not with building your own packages (pve-cluster and pve-ha-manager would be the ones), I'm afraid...

Razva · Sep 12, 2017

Understood. I guess this applies to shared storage as well as Replicated images, correct?

t.lamprecht · Sep 12, 2017

I saw that you could possible mean Replication Management not our HA management, if so my answer can be mostly disregarded...

Razva · Sep 12, 2017

t.lamprecht said:
I saw that you could possible mean Replication Management not our HA management, if so my answer can be mostly disregarded...

I was talking about HA, because - I guess? - that "how content got there" is not influencing the failover system. If I'm wrong please correct me.

Alwin · Sep 12, 2017

You can for example try to change the watchdog (eg. ipmi) timeout. Please read the documentation as any settins on this has serious implications.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Razva · Sep 12, 2017

Alwin said:
You can for example change the watchdog (eg. ipmi) and/or the corosync timeout. Please read the documentation as it has serious implications.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

I think I'll just leave it on default and deal with 3 minutes of downtime.

t.lamprecht · Sep 12, 2017

Razva said:
I was talking about HA, because - I guess? - that "how content got there" is not influencing the failover system. If I'm wrong please correct me.

Sorry for the confusion ^^ Yes, your assumptions are correct, the (automatic) recovery of VMs/CTs on node failure is HA's work not replication...

But if you use the replication in combination with HA you should know that your only recover from the last replicated state.
Meaning, if you replicate your VM every 15 minutes you could loose up to 15 minutes of storage writes from inside the VM, which may or may not be a problem...
If this VM is just a "Compute Node" this may be OK. But if its, for example, a NFS server then you may get into trouble...
If replication and High Availability is needed then Ceph could be recommended, or glusterFS too.

b.miller · Jan 28, 2022

t.lamprecht said:
Fencing need currently ~120 seconds because thats the time where our distributed cluster lock time out.
Basically its, after 60 seconds the outage detection triggers, form now on the current master tries to get the lock.
Worst case he needs another 60 seconds for that (i.e. the lost node renewed its locked directly before it crashed), then recovery may add a few seconds but it should be in the range of 5 to 10 seconds.

The cluster lock timeout period is currently almost hard coded, so you cannot change that easily at the moment - at least not with building your own packages (pve-cluster and pve-ha-manager would be the ones), I'm afraid...

Hi Thomas -

I know I am reviving a very old thread here. But I am curious if the hard-coded lockout timers are still the same for the current version of PVE.

I am currently testing a new deployment.

1 Cluster
3 Nodes
3 OSD Ceph

After pulling the network connections, it took exactly 120sec for the watchdog to decide that the node was offline. It sent me a couple fencing emails. And then ping on the VM resumed after 110sec. So total just under 4min from the time the cables were pulled.

Does this time match the watchdog timers?

t.lamprecht · Jan 31, 2022

bamzilla16 said:
I know I am reviving a very old thread here. But I am curious if the hard-coded lockout timers are still the same for the current version of PVE.

Yes.

bamzilla16 said:
After pulling the network connections, it took exactly 120sec for the watchdog to decide that the node was offline.

The watchdog only fences the node if it would be powered but not responding (hung up or disconnected from network), it doesn't decide anything. The current HA CRM master node will mark a node as offline after 60s from the last status update of the node, from that time on the CRM will try to get the node lock and, if it could acquire it, recover the services, in the worst case that will start to happen after 120s total (i.e., 60s to start trying to fence plus 60s it takes for the lock to timeout in any case). Service recovery is a fresh start on a new node, if the VM is slow to start that needs naturally adds additional time.

So, after 120s max the VM should be recovered but is only yet starting up, so boot time is on top.

b.miller · Jan 31, 2022

t.lamprecht said:
Yes.

The watchdog only fences the node if it would be powered but not responding (hung up or disconnected from network), it doesn't decide anything. The current HA CRM master node will mark a node as offline after 60s from the last status update of the node, from that time on the CRM will try to get the node lock and, if it could acquire it, recover the services, in the worst case that will start to happen after 120s total (i.e., 60s to start trying to fence plus 60s it takes for the lock to timeout in any case). Service recovery is a fresh start on a new node, if the VM is slow to start that needs naturally adds additional time.

So, after 120s max the VM should be recovered but is only yet starting up, so boot time is on top.

Thanks for clarifying the process. This helps immensely.

PTR · Feb 20, 2022

in my setup in do have a separate hardware reaction (fencing) implemented that switches of node1 when it fails (or looks like...). Next I would need the command how to the the cluster hat is if definitely "offline", so that the replicas recovery may start on node2. What is the command to immediately start the replication recovery on node2, or the "set offline status" for node1? Thanks already!!!

t.lamprecht · Feb 21, 2022

PTR said:
What is the command to immediately start the replication recovery on node2, or the "set offline status" for node1?

Sorry, there's no such command, the cluster lock of the other node always needs to be acquired before.
In theory, you could force the lock release, but that's pretty dangerous stuff and I would not like to post detailed instructions here - you can find out everything from ha/pmxcfs code though.

Search

Search

[SOLVED] Faster failover - possible?

Razva

Renowned Member

t.lamprecht

Proxmox Staff Member

Razva

Renowned Member

t.lamprecht

Proxmox Staff Member

Razva

Renowned Member

Alwin

Proxmox Retired Staff

Razva

Renowned Member

t.lamprecht

Proxmox Staff Member

b.miller

Member

t.lamprecht

Proxmox Staff Member

b.miller

Member

PTR

New Member

t.lamprecht

Proxmox Staff Member

We value your privacy