PVE HighAvailability slow?

nicedevil · Jan 19, 2024

Hey guys,

after I was able with the help of this forum and googling to get rid of all my issues with Proxmox HA Cluster over Thunderbolt 4 and an issue belonging to NTP that hold me back on enabling CEPH propery... I'm now working with my new cluster.
After my first LXC/VM was setup I tryed to disable the Networkport on my switch for PVE02 and wanted to see how HA is working on Proxmox. I had small experience on VMWare how much time it takes there, so pretty new stuff for me.

My HA failover (migrate a LXC to an other node after it was turned red, not available in the webgui) took about 5 minutes. Is this normal or is there a way to speed those things up? A normal test on rightclick migrate to an other node was done in a few seconds without any loss of ping during the process.

My setup is as follows: Network for Ceph (public + cluster) and Migration is the thunderbolt 4 network (10.0.0.81/29).

Also tried the "type=insecure", doesn't make any difference.

The default NIC over ethernet is used for the vmbr0 to have access to my containers.

After I thought 4-5 times about my HA failover test... the Thunderbolt cable connection wasn't turned off while the switchport for vmbr0 was... so maybe proxmox wasn't realy sure if the node was lost or not, because the ceph connection was still there (public/cluster/migration network).

I guess someone here can clear this up if this is working as intended. On monday I can "pull" a powercable for an absolute real test if needed.

sb-jw · Jan 19, 2024

nicedevil said:
My HA failover

To provide a little education, HA and failover are two different things. With HA, the service can and may fail briefly in order to be restarted on another node. In the case of failover, this happens without downtime, but this is currently not supported by Proxmox VE.

nicedevil said:
Also tried the "type=insecure", doesn't make any difference.

Yes, that's right because it has nothing to do with behavior at all. All you're saying is that you trust your network and that the live migration should be unencrypted.

See 5.14.1: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_guest_migration

nicedevil said:
so maybe proxmox wasn't realy sure if the node was lost or not, because the ceph connection was still there (public/cluster/migration network).

This has nothing to do with the CEPH network. The question is rather, how many Corosync connections have you set up and which interfaces do they run through?

See: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

You can find more information about HA specifically here: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager

nicedevil · Jan 19, 2024

sb-jw said:
To provide a little education, HA and failover are two different things. With HA, the service can and may fail briefly in order to be restarted on another node. In the case of failover, this happens without downtime, but this is currently not supported by Proxmox VE.

Yes, that's right because it has nothing to do with behavior at all. All you're saying is that you trust your network and that the live migration should be unencrypted.

See 5.14.1: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_guest_migration

This has nothing to do with the CEPH network. The question is rather, how many Corosync connections have you set up and which interfaces do they run through?

See: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

You can find more information about HA specifically here: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager

thank you for clearing this up

for your last question: I have a 3 node cluster that looks like this:

It is a full mesh setup so 3 nodes are connected to each other.

esi_y · Jan 20, 2024

nicedevil said:
After my first LXC/VM was setup I tryed to disable the Networkport on my switch for PVE02 and wanted to see how HA is working on Proxmox. I had small experience on VMWare how much time it takes there, so pretty new stuff for me.

Just to confirm, you are disabling a switchport for the corosync connection to cut the said node off and see how quickly the VM/CT resumes on either of the two remaining ones using the TB4 as migration network?

nicedevil said:
My HA failover (migrate a LXC to an other node after it was turned red, not available in the webgui) took about 5 minutes. Is this normal or is there a way to speed those things up?

Generally, it is not normal. With CEPH, your filesystem should be available on the other two nodes at all times, so starting up, especially an LXC should be fast.

nicedevil said:
A normal test on rightclick migrate to an other node was done in a few seconds without any loss of ping during the process.

My setup is as follows: Network for Ceph (public + cluster) and Migration is the thunderbolt 4 network (10.0.0.81/29).

The TB4 migration network is done with FRR on the 3 nodes?

nicedevil said:
After I thought 4-5 times about my HA failover test... the Thunderbolt cable connection wasn't turned off while the switchport for vmbr0 was...

Also the migration network being up should have no bearing on the quorum status.

What does the ha-manager status say?

What can you find about the slow migration with journalctl -u pve-ha-crm on the master and journalctl -u pve-ha-lrm on the one node where service got restarted?

nicedevil · Jan 20, 2024

tempacc346235 said:
Just to confirm, you are disabling a switchport for the corosync connection to cut the said node off and see how quickly the VM/CT resumes on either of the two remaining ones using the TB4 as migration network?

correct

tempacc346235 said:
Generally, it is not normal. With CEPH, your filesystem should be available on the other two nodes at all times, so starting up, especially an LXC should be fast.

thats why migration with rightclick is that fast, ok.

tempacc346235 said:
The TB4 migration network is done with FRR on the 3 nodes?

correct

tempacc346235 said:
What does the ha-manager status say?

Bash:

root@pve03:~# ha-manager status
quorum OK
master pve03 (active, Sat Jan 20 08:10:08 2024)
lrm pve01 (active, Sat Jan 20 08:10:05 2024)
lrm pve02 (active, Sat Jan 20 08:10:06 2024)
lrm pve03 (active, Sat Jan 20 08:09:58 2024)
service ct:26002 (pve03, started)
service ct:27002 (pve03, started)
service ct:27003 (pve01, started)
service ct:27004 (pve01, started)
service ct:44002 (pve02, started)
service ct:71002 (pve01, started)
service ct:99999 (pve03, disabled)
service vm:34002 (pve02, started)

tempacc346235 said:
What can you find about the slow migration with journalctl -u pve-ha-crm on the master and journalctl -u pve-ha-lrm on the one node where service got restarted?

Master (PVE03)

Bash:

root@pve03:~# journalctl -u pve-ha-crm --since "10min ago"
Jan 20 08:15:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'online' => 'unknown'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'unknown' => 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: lost lock 'ha_agent_pve02_lock - can't get cfs lock
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: successfully acquired lock 'ha_agent_pve02_lock'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: fencing: acknowledged - got agent lock for node 'pve02'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'fence' => 'unknown'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'ct:44002' from fenced node 'pve02' to node 'pve03'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'recovery' to 'started'  (node = pve03)
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'vm:34002' from fenced node 'pve02' to node 'pve01'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'recovery' to 'started'  (node = pve01)

lost node PVE02

Bash:

root@pve02:~# journalctl -u pve-ha-lrm  --since "10min ago"
Jan 20 08:15:48 pve02 pve-ha-lrm[1956]: lost lock 'ha_agent_pve02_lock - cfs lock update failed - Device or resource busy
Jan 20 08:15:53 pve02 pve-ha-lrm[1956]: status change active => lost_agent_lock
-- Boot 31898d9975dd4353adab722eece0f45e --
Jan 20 08:17:53 pve02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: starting server
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: status change startup => wait_for_agent_lock
Jan 20 08:17:54 pve02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.

I guess last lines with 8:17 are from time where I enable switch port again

esi_y · Jan 25, 2024

Not sure if you are still troubleshooting this, but this looks more reasonable...

nicedevil said:

Master (PVE03)

Bash:

root@pve03:~# journalctl -u pve-ha-crm --since "10min ago"
Jan 20 08:15:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'online' => 'unknown'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'started' to 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'unknown' => 'fence'
Jan 20 08:16:38 pve03 pve-ha-crm[1229]: lost lock 'ha_agent_pve02_lock - can't get cfs lock
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: successfully acquired lock 'ha_agent_pve02_lock'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: fencing: acknowledged - got agent lock for node 'pve02'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: node 'pve02': state changed from 'fence' => 'unknown'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'fence' to 'recovery'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'ct:44002' from fenced node 'pve02' to node 'pve03'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'ct:44002': state changed from 'recovery' to 'started'  (node = pve03)
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: recover service 'vm:34002' from fenced node 'pve02' to node 'pve01'
Jan 20 08:17:48 pve03 pve-ha-crm[1229]: service 'vm:34002': state changed from 'recovery' to 'started'  (node = pve01)

So the master recognizes when the lost pve02 is fencing and decides to recover the service on pve03 exactly minute 2 after the pve02 was first reported lost.

nicedevil said:

lost node PVE02

Bash:

root@pve02:~# journalctl -u pve-ha-lrm  --since "10min ago"
Jan 20 08:15:48 pve02 pve-ha-lrm[1956]: lost lock 'ha_agent_pve02_lock - cfs lock update failed - Device or resource busy
Jan 20 08:15:53 pve02 pve-ha-lrm[1956]: status change active => lost_agent_lock
-- Boot 31898d9975dd4353adab722eece0f45e --
Jan 20 08:17:53 pve02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: starting server
Jan 20 08:17:54 pve02 pve-ha-lrm[1933]: status change startup => wait_for_agent_lock
Jan 20 08:17:54 pve02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.

I guess last lines with 8:17 are from time where I enable switch port again

This is not as interesting as would be to see what happened on pve03 where the service went on to be restarted then. Ideally you should have had the service recovered 2min into the loss of pve02 (plus its startup time, if any significant), it should not have been 5 minutes as mentioned in the OP.

Search

Search

PVE HighAvailability slow?

nicedevil

Member

sb-jw

Famous Member

nicedevil

Member

esi_y

Renowned Member

nicedevil

Member

esi_y

Renowned Member

We value your privacy