Cluster failure due to pve-ha-crm failing after recent package update

Rilane

New Member
Aug 18, 2025
1
0
1
Hey there,
I'd like to collect some feedback on a recent outage we had, which was seemingly triggered by the update of pve-manager (released via the enterprise repo) on August 12th.

One of our PVE clusters, which consists of 9 nodes had 5 of them fail over the course of 15 minutes. The cluster runs about 500 VMs, all of which have a HA State configured and uses Ceph with 108 OSDs.

A rough overview of the outage can be based on the KNET logs of the first node:

Code:
Aug 12 06:42:48 pve-01 corosync[2916]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 06:42:48 pve-01 corosync[2916]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:42:48 pve-01 corosync[2916]:   [KNET  ] host: host: 2 has no active links
Aug 12 06:45:55 pve-01 corosync[2916]:   [KNET  ] link: host: 4 link: 0 is down
Aug 12 06:45:55 pve-01 corosync[2916]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 12 06:45:55 pve-01 corosync[2916]:   [KNET  ] host: host: 4 has no active links
Aug 12 06:45:57 pve-01 corosync[2916]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 12 06:45:57 pve-01 corosync[2916]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:45:57 pve-01 corosync[2916]:   [KNET  ] pmtud: Global data MTU changed to: 8885
Aug 12 06:47:59 pve-01 corosync[2916]:   [KNET  ] link: host: 5 link: 0 is down
Aug 12 06:47:59 pve-01 corosync[2916]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 12 06:47:59 pve-01 corosync[2916]:   [KNET  ] host: host: 5 has no active links
Aug 12 06:50:06 pve-01 corosync[2916]:   [KNET  ] link: host: 7 link: 0 is down
Aug 12 06:50:06 pve-01 corosync[2916]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Aug 12 06:50:06 pve-01 corosync[2916]:   [KNET  ] host: host: 7 has no active links
Aug 12 06:54:42 pve-01 corosync[2916]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 06:54:42 pve-01 corosync[2916]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:54:42 pve-01 corosync[2916]:   [KNET  ] host: host: 2 has no active links
Aug 12 06:57:59 pve-01 corosync[2916]:   [KNET  ] link: host: 3 link: 0 is down
Aug 12 06:57:59 pve-01 corosync[2916]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 12 06:57:59 pve-01 corosync[2916]:   [KNET  ] host: host: 3 has no active links

pve-02 failed (and briefly reappeared only to fail again), followed by pve-04, pve-05, pve-07 and pve-03. At this point only 4 of 9 nodes were left and the entire cluster failed/fenced, resulting in production outage.

All seems to begin with an automated update of packages on pve-02:

Code:
Aug 12 06:28:05 pve-02 systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Enabled logging to syslog via daemon facility
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Starting unattended upgrades script
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Allowed origins are: origin=Proxmox, o=Proxmox, origin=Debian,codename=bookworm,label=Debian, origin=Debian,codename=bookworm,label=Debian-Security, origin=Debian,codename=bookworm-security,label=Debian-Security, o=Debian,a=stable-updates, o=Zabbix,c=main,n=bookworm
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Initial blacklist:
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Initial whitelist (not strict):
Aug 12 06:28:09 pve-02 unattended-upgrade[485363]: Packages that will be upgraded: libpve-common-perl proxmox-backup-client proxmox-backup-file-restore pve-docs pve-manager


Various PVE related services are restarted as part of the pve-manager update, everything seems fine:

Code:
Aug 12 06:29:10 pve-02 pve-ha-lrm[3041484]: watchdog closed (disabled)
Aug 12 06:29:10 pve-02 pve-ha-lrm[3041484]: server stopped
Aug 12 06:29:11 pve-02 systemd[1]: pve-ha-lrm.service: Deactivated successfully.
Aug 12 06:29:11 pve-02 systemd[1]: Stopped pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Aug 12 06:29:11 pve-02 systemd[1]: pve-ha-lrm.service: Consumed 1d 5h 21min 30.790s CPU time.
Aug 12 06:29:11 pve-02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Aug 12 06:29:11 pve-02 pve-ha-lrm[488117]: starting server
Aug 12 06:29:11 pve-02 pve-ha-lrm[488117]: status change startup => wait_for_agent_lock
Aug 12 06:29:11 pve-02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Aug 12 06:29:11 pve-02 systemd[1]: Stopping pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: received signal TERM
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: server received shutdown request
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: server stopped
Aug 12 06:29:13 pve-02 systemd[1]: pve-ha-crm.service: Deactivated successfully.
Aug 12 06:29:13 pve-02 systemd[1]: Stopped pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Aug 12 06:29:13 pve-02 systemd[1]: pve-ha-crm.service: Consumed 10min 59.217s CPU time.
Aug 12 06:29:13 pve-02 systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Aug 12 06:29:14 pve-02 pve-ha-crm[488126]: starting server
Aug 12 06:29:14 pve-02 pve-ha-crm[488126]: status change startup => wait_for_quorum
Aug 12 06:29:14 pve-02 systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Aug 12 06:29:18 pve-02 unattended-upgrade[485363]: All upgrades installed
Aug 12 06:29:19 pve-02 systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Aug 12 06:29:19 pve-02 systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Aug 12 06:29:19 pve-02 systemd[1]: apt-daily-upgrade.service: Consumed 31.786s CPU time.
```

pve-02 resumes managing its VMs, but gets stuck in a bad loop when attempting to manage the HA state. There are thousands of logs entries of it trying to change the state of VMs from freeze to started or back. During this period there is no outage, all VMs continue to work fine.

Code:
Aug 12 06:29:36 pve-02 pve-ha-lrm[488117]: status change wait_for_agent_lock => active
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: successfully acquired lock 'ha_manager_lock'
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: watchdog active
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: status change slave => master
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: using scheduler mode 'static'
Aug 12 06:31:46 pve-02 pve-ha-crm[488126]: service 'vm:1005012': state changed from 'freeze' to 'started'
(many removed repeats)
Aug 12 06:31:58 pve-02 pve-ha-crm[488126]: service 'vm:2799002': state changed from 'freeze' to 'started'
Aug 12 06:31:58 pve-02 pve-ha-crm[488126]: service 'vm:2834011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:2836011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3436013': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3438011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3443011': state changed from 'freeze' to 'started'
Aug 12 06:32:00 pve-02 pve-ha-crm[488126]: service 'vm:3448011': state changed from 'freeze' to 'started'
Aug 12 06:32:00 pve-02 pve-ha-crm[488126]: loop take too long (31 seconds)
Aug 12 06:32:11 pve-02 pve-ha-crm[488126]: service 'vm:2233011': state changed from 'request_stop' to 'stopped'
Aug 12 06:32:11 pve-02 pve-ha-crm[488126]: service 'vm:2238011': state changed from 'request_stop' to 'stopped'
Aug 12 06:34:21 pve-02 pve-ha-crm[488126]: service 'vm:1008010': state changed from 'started' to 'freeze'
Aug 12 06:34:21 pve-02 pve-ha-crm[488126]: service 'vm:1098011': state changed from 'started' to 'freeze'
Aug 12 06:34:22 pve-02 pve-ha-crm[488126]: service 'vm:1135011': state changed from 'started' to 'freeze'
(many repeats)
Aug 12 06:42:34 pve-02 pve-ha-crm[488126]: service 'vm:2536011': state changed from 'started' to 'freeze'
Aug 12 06:42:34 pve-02 pve-ha-crm[488126]: service 'vm:2543011': state changed from 'started' to 'freeze'
Aug 12 06:42:35 pve-02 pve-ha-crm[488126]: service 'vm:2598011': state changed from 'started' to 'freeze'
Aug 12 06:42:35 pve-02 watchdog-mux[1971]: client watchdog expired - disable watchdog updates
Aug 12 06:42:35 pve-02 pve-ha-crm[488126]: service 'vm:2640011': state changed from 'started' to 'freeze'
Aug 12 06:42:36 pve-02 pve-ha-crm[488126]: service 'vm:2642011': state changed from 'started' to 'freeze'
Aug 12 06:42:36 pve-02 watchdog-mux[1971]: exit watchdog-mux with active connections
Aug 12 06:42:36 pve-02 systemd-journald[2334617]: Received client request to sync journal.
Aug 12 06:42:36 pve-02 kernel: watchdog: watchdog0: watchdog did not stop!
Aug 12 06:42:36 pve-02 systemd[1]: watchdog-mux.service: Deactivated successfully.
Aug 12 06:42:36 pve-02 systemd[1]: watchdog-mux.service: Consumed 3min 20.188s CPU time.
Aug 12 06:42:36 pve-02 pve-ha-crm[488126]: service 'vm:2732011': state changed from 'started' to 'freeze'
Aug 12 06:42:37 pve-02 pve-ha-crm[488126]: service 'vm:2748011': state changed from 'started' to 'freeze'
Aug 12 06:42:37 pve-02 pve-ha-crm[488126]: service 'vm:2798012': state changed from 'started' to 'freeze'
Aug 12 06:42:38 pve-02 pve-ha-crm[488126]: service 'vm:2898012': state changed from 'started' to 'freeze'
Aug 12 06:42:38 pve-02 pve-ha-crm[488126]: service 'vm:2933011': state changed from 'started' to 'freeze'
Aug 12 06:42:39 pve-02 pve-ha-crm[488126]: service 'vm:2936011': state changed from 'started' to 'freeze'
Aug 12 06:42:39 pve-02 pve-ha-crm[488126]: service 'vm:2936013': state changed from 'started' to 'freeze'
Aug 12 06:42:40 pve-02 pve-ha-crm[488126]: service 'vm:2936015': state changed from 'started' to 'freeze'
Aug 12 06:42:40 pve-02 pve-ha-crm[488126]: service 'vm:2939011': state changed from 'started' to 'freeze'
Aug 12 06:42:41 pve-02 pve-ha-crm[488126]: loop take too long (67 seconds)
Aug 12 06:42:41 pve-02 pve-ha-crm[488126]: watchdog update failed - Broken pipe
-- Boot ab6d795743444738879327aaebb0567a --


After a few minutes of doing this, pve-02 fences, as pve-ha-crm being stuck seems to result in the watchdog no longer being updated.

pve-04 takes over as manager of pve-ha-crm. It begins to spam about the same problem and one minute after taking over, its watchdog also kills it:

Code:
Aug 12 06:43:50 pve-04 pmxcfs[1836186]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve-06/cephfs: -1
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: successfully acquired lock 'ha_manager_lock'
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: watchdog active
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: status change slave => master
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: using scheduler mode 'static'
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: node 'pve-02': state changed from 'online' => 'unknown'
Aug 12 06:45:05 pve-04 pve-ha-crm[155556]: service 'vm:1136011': state changed from 'freeze' to 'started'
Aug 12 06:45:06 pve-04 pve-ha-crm[155556]: service 'vm:1136039': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:06 pve-04 pve-ha-crm[155556]: service 'vm:1136040': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1142011': state changed from 'freeze' to 'started'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1142013': state changed from 'freeze' to 'started'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1233012': state changed from 'freeze' to 'started'
Aug 12 06:45:08 pve-04 pve-ha-crm[155556]: service 'vm:1236039': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:08 pve-04 pve-ha-crm[155556]: service 'vm:1242011': state changed from 'freeze' to 'started'
(many removed repeats)
Aug 12 06:45:40 pve-04 pve-ha-crm[155556]: service 'vm:2099002': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:40 pve-04 pve-ha-crm[155556]: service 'vm:2099003': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:41 pve-04 pve-ha-crm[155556]: service 'vm:2105011': state changed from 'freeze' to 'started'
Aug 12 06:45:41 pve-04 pve-ha-crm[155556]: service 'vm:2105012': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2132011': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2133011': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2134011': state changed from 'freeze' to 'started'
Aug 12 06:45:43 pve-04 pve-ha-crm[155556]: service 'vm:2135011': state changed from 'freeze' to 'started'
Aug 12 06:45:43 pve-04 watchdog-mux[2001]: client watchdog expired - disable watchdog updates
Aug 12 06:45:43 pve-04 watchdog-mux[2001]: exit watchdog-mux with active connections
Aug 12 06:45:43 pve-04 systemd-journald[3257916]: Received client request to sync journal.
Aug 12 06:45:43 pve-04 kernel: watchdog: watchdog0: watchdog did not stop!
Aug 12 06:45:43 pve-04 systemd[1]: watchdog-mux.service: Deactivated successfully.
Aug 12 06:45:43 pve-04 systemd[1]: watchdog-mux.service: Consumed 2min 53.237s CPU time.
Aug 12 06:45:43 pve-04 pve-ha-crm[155556]: service 'vm:2136011': state changed from 'freeze' to 'started'
Aug 12 06:45:44 pve-04 pve-ha-crm[155556]: service 'vm:2136013': state changed from 'freeze' to 'started'
Aug 12 06:45:44 pve-04 pve-ha-crm[155556]: service 'vm:2136015': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2137011': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2138011': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2139011': state changed from 'freeze' to 'started'
Aug 12 06:45:46 pve-04 pve-ha-crm[155556]: service 'vm:2140011': state changed from 'freeze' to 'started'
Aug 12 06:45:46 pve-04 pve-ha-crm[155556]: service 'vm:2141011': state changed from 'freeze' to 'started'
Aug 12 06:45:47 pve-04 pve-ha-crm[155556]: service 'vm:2142011': state changed from 'freeze' to 'started'
Aug 12 06:45:47 pve-04 pve-ha-crm[155556]: service 'vm:2143011': state changed from 'freeze' to 'started'
Aug 12 06:45:48 pve-04 pve-ha-crm[155556]: service 'vm:2148011': state changed from 'freeze' to 'started'
Aug 12 06:45:48 pve-04 pve-ha-crm[155556]: service 'vm:2198011': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2198012': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2199002': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2199003': state changed from 'freeze' to 'started'
Aug 12 06:45:50 pve-04 pve-ha-crm[155556]: service 'vm:2205011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:50 pve-04 pve-ha-crm[155556]: service 'vm:2205012': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:51 pve-04 pve-ha-crm[155556]: service 'vm:2237011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:51 pve-04 pve-ha-crm[155556]: service 'vm:2240011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:52 pve-04 pve-ha-crm[155556]: service 'vm:2242011': state changed from 'freeze' to 'request_stop'
```

This log was not cutoff by me, above message is the last one before the node fenced.

This exact pattern repeated on all other hosts which went down. One by one they took over as HA master with "pve-ha-crm[155556]: status change slave => master", attempted to manage VM HA State and got fenced by their own watchdog quickly, despite networking, all VMs and all services aside from pve-ha-crm working perfectly fine until the last second.

After cold starting the entire cluster, the issue was gone and everything worked as usual. Before this incident, the cluster ran fine across restarts, updates etc. for about a year.

I'd appreciate thoughts on:

  • What could be done to avoid such outages?
  • This happened entirely unattended, but even if someone had been present during this failure cascade, we have no idea what could have been done to stop it
  • What could be possible causes of this outage?

Thank you
 
Hello,
This does sounds like the nodes has been fenced. The cluster (Corosync) relies on a low-latency and stable network. Therefore, it should have its own dedicated physical network[1]. It doesn't require much bandwidth, 1G is more than sufficient, most important is low latency[2].
I am not sure if you have a dedicated NIC for corosync, if not I would suggest to start with this and separate the storage network from it. you may also consider to use a redundant link [3].

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_cluster_network
[2] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_requirements
[3] https://pve.proxmox.com/wiki/Cluster_Manager#_adding_redundant_links_to_an_existing_cluster
 
Hi!

Which version of pve-manager and pve-ha-manager are you running on the nodes in the cluster?

I'd appreciate thoughts on:

  • What could be done to avoid such outages?
  • This happened entirely unattended, but even if someone had been present during this failure cascade, we have no idea what could have been done to stop it
  • What could be possible causes of this outage?

When the pve-ha-manager package is upgraded (which is dependent on upgrades to pve-manager), the LRM is put in restart mode and the local HA resources are then put in freeze state by the HA Manager, which might have moved if the current node (e.g. pve-01) was the HA Manager before. After the upgrade is done, the LRM is put in active mode again and the HA Manager will move the freezed HA resources in started/request_stop state, depending on their configured state.

Without knowing the versions, one likely cause I can see from the logs, which resulted in the HA Manager not updating its own watchdog and therefore fencing its own node, was the use of the static load scheduler. Currently, the static load scheduler is still in a technology preview and doesn't handle hundreds or thousands of HA resources very well [0], especially if there are a lot of services changing their state. So it shouldn't be used for production systems, but is currently recommended for small setups and testing purposes.

I'm currently preparing a patch series for fixing this issue, so that the static load scheduler is able to work with thousands of HA resources more efficiently. With a few more tests about the scheduler's stability, these changes could move the static load scheduler out of technology preview for more production usage.

For the mean time, this means that the static load scheduler shouldn't be used for large, production clusters with many configured HA resources until these issues have been resolved.

[0] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_static_load_scheduler