PVE HighAvailability slow?

nicedevil

Member
Aug 5, 2021
110
10
23
Hey guys,

after I was able with the help of this forum and googling to get rid of all my issues with Proxmox HA Cluster over Thunderbolt 4 and an issue belonging to NTP that hold me back on enabling CEPH propery... I'm now working with my new cluster.
After my first LXC/VM was setup I tryed to disable the Networkport on my switch for PVE02 and wanted to see how HA is working on Proxmox. I had small experience on VMWare how much time it takes there, so pretty new stuff for me.

My HA failover (migrate a LXC to an other node after it was turned red, not available in the webgui) took about 5 minutes. Is this normal or is there a way to speed those things up? A normal test on rightclick migrate to an other node was done in a few seconds without any loss of ping during the process.

My setup is as follows: Network for Ceph (public + cluster) and Migration is the thunderbolt 4 network (10.0.0.81/29).

1705695986143.png
Also tried the "type=insecure", doesn't make any difference.

The default NIC over ethernet is used for the vmbr0 to have access to my containers.

After I thought 4-5 times about my HA failover test... the Thunderbolt cable connection wasn't turned off while the switchport for vmbr0 was... so maybe proxmox wasn't realy sure if the node was lost or not, because the ceph connection was still there (public/cluster/migration network).

I guess someone here can clear this up if this is working as intended. On monday I can "pull" a powercable for an absolute real test if needed.
 
My HA failover
To provide a little education, HA and failover are two different things. With HA, the service can and may fail briefly in order to be restarted on another node. In the case of failover, this happens without downtime, but this is currently not supported by Proxmox VE.

Also tried the "type=insecure", doesn't make any difference.
Yes, that's right because it has nothing to do with behavior at all. All you're saying is that you trust your network and that the live migration should be unencrypted.

See 5.14.1: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_guest_migration

so maybe proxmox wasn't realy sure if the node was lost or not, because the ceph connection was still there (public/cluster/migration network).
This has nothing to do with the CEPH network. The question is rather, how many Corosync connections have you set up and which interfaces do they run through?

See: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

You can find more information about HA specifically here: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager
 
  • Like
Reactions: nicedevil
To provide a little education, HA and failover are two different things. With HA, the service can and may fail briefly in order to be restarted on another node. In the case of failover, this happens without downtime, but this is currently not supported by Proxmox VE.


Yes, that's right because it has nothing to do with behavior at all. All you're saying is that you trust your network and that the live migration should be unencrypted.

See 5.14.1: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_guest_migration


This has nothing to do with the CEPH network. The question is rather, how many Corosync connections have you set up and which interfaces do they run through?

See: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

You can find more information about HA specifically here: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager
thank you for clearing this up :)

for your last question: I have a 3 node cluster that looks like this:
1705700143015.png

It is a full mesh setup so 3 nodes are connected to each other.
 
After my first LXC/VM was setup I tryed to disable the Networkport on my switch for PVE02 and wanted to see how HA is working on Proxmox. I had small experience on VMWare how much time it takes there, so pretty new stuff for me.

Just to confirm, you are disabling a switchport for the corosync connection to cut the said node off and see how quickly the VM/CT resumes on either of the two remaining ones using the TB4 as migration network?

My HA failover (migrate a LXC to an other node after it was turned red, not available in the webgui) took about 5 minutes. Is this normal or is there a way to speed those things up?

Generally, it is not normal. With CEPH, your filesystem should be available on the other two nodes at all times, so starting up, especially an LXC should be fast.

A normal test on rightclick migrate to an other node was done in a few seconds without any loss of ping during the process.

My setup is as follows: Network for Ceph (public + cluster) and Migration is the thunderbolt 4 network (10.0.0.81/29).

The TB4 migration network is done with FRR on the 3 nodes?

After I thought 4-5 times about my HA failover test... the Thunderbolt cable connection wasn't turned off while the switchport for vmbr0 was...

Also the migration network being up should have no bearing on the quorum status.

What does the ha-manager status say?

What can you find about the slow migration with journalctl -u pve-ha-crm on the master and journalctl -u pve-ha-lrm on the one node where service got restarted?
 
Just to confirm, you are disabling a switchport for the corosync connection to cut the said node off and see how quickly the VM/CT resumes on either of the two remaining ones using the TB4 as migration network?
correct

Generally, it is not normal. With CEPH, your filesystem should be available on the other two nodes at all times, so starting up, especially an LXC should be fast.
thats why migration with rightclick is that fast, ok.

The TB4 migration network is done with FRR on the 3 nodes?
correct

What does the ha-manager status say?

Bash:
root@pve03:~# ha-manager status
quorum OK
master pve03 (active, Sat Jan 20 08:10:08 2024)
lrm pve01 (active, Sat Jan 20 08:10:05 2024)
lrm pve02 (active, Sat Jan 20 08:10:06 2024)
lrm pve03 (active, Sat Jan 20 08:09:58 2024)
service ct:26002 (pve03, started)
service ct:27002 (pve03, started)
service ct:27003 (pve01, started)
service ct:27004 (pve01, started)
service ct:44002 (pve02, started)
service ct:71002 (pve01, started)
service ct:99999 (pve03, disabled)
service vm:34002 (pve02, started)

What can you find about the slow migration with journalctl -u pve-ha-crm on the master and journalctl -u pve-ha-lrm on the one node where service got restarted?

Master (PVE03)
Bash:
root@pve03:~# journalctl -u pve-ha-crm --since "10min ago"
Jan 20 08:15:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'online' => 'unknown'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'unknown' => 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: lost lock 'ha_agent_pve02_lock - can't get cfs lock
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: successfully acquired lock 'ha_agent_pve02_lock'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: fencing: acknowledged - got agent lock for node 'pve02'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'fence' => 'unknown'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'ct:44002' from fenced node 'pve02' to node 'pve03'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'recovery' to 'started'  (node = pve03)
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'vm:34002' from fenced node 'pve02' to node 'pve01'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'recovery' to 'started'  (node = pve01)

lost node PVE02
Bash:
root@pve02:~# journalctl -u pve-ha-lrm  --since "10min ago"
Jan 20 08:15:48 pve02 pve-ha-lrm[1956]: lost lock 'ha_agent_pve02_lock - cfs lock update failed - Device or resource busy
Jan 20 08:15:53 pve02 pve-ha-lrm[1956]: status change active => lost_agent_lock
-- Boot 31898d9975dd4353adab722eece0f45e --
Jan 20 08:17:53 pve02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: starting server
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: status change startup => wait_for_agent_lock
Jan 20 08:17:54 pve02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
I guess last lines with 8:17 are from time where I enable switch port again
 
Last edited:
Not sure if you are still troubleshooting this, but this looks more reasonable...

Master (PVE03)
Bash:
root@pve03:~# journalctl -u pve-ha-crm --since "10min ago"
Jan 20 08:15:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'online' => 'unknown'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'unknown' => 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: lost lock 'ha_agent_pve02_lock - can't get cfs lock
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: successfully acquired lock 'ha_agent_pve02_lock'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: fencing: acknowledged - got agent lock for node 'pve02'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'fence' => 'unknown'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'ct:44002' from fenced node 'pve02' to node 'pve03'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'recovery' to 'started'  (node = pve03)
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'vm:34002' from fenced node 'pve02' to node 'pve01'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'recovery' to 'started'  (node = pve01)

So the master recognizes when the lost pve02 is fencing and decides to recover the service on pve03 exactly minute 2 after the pve02 was first reported lost.

lost node PVE02
Bash:
root@pve02:~# journalctl -u pve-ha-lrm  --since "10min ago"
Jan 20 08:15:48 pve02 pve-ha-lrm[1956]: lost lock 'ha_agent_pve02_lock - cfs lock update failed - Device or resource busy
Jan 20 08:15:53 pve02 pve-ha-lrm[1956]: status change active => lost_agent_lock
-- Boot 31898d9975dd4353adab722eece0f45e --
Jan 20 08:17:53 pve02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: starting server
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: status change startup => wait_for_agent_lock
Jan 20 08:17:54 pve02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
I guess last lines with 8:17 are from time where I enable switch port again

This is not as interesting as would be to see what happened on pve03 where the service went on to be restarted then. Ideally you should have had the service recovered 2min into the loss of pve02 (plus its startup time, if any significant), it should not have been 5 minutes as mentioned in the OP.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!