pve-ha-crm breaking our cluster... again

Michiel_1afa · Jun 10, 2026

3x this week already - last saturday, yesterday and this morning

pve-ha-crm decides to die without cause, or at least without a usable message.

"watchdog update failed - Broken pipe"

The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.

journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster

system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.

What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.

As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.

dakralex · Jun 10, 2026

Hi!

Does this happen on a specific cluster node or all cluster nodes?
What kind of watchdog is used on the node(s) where this happens?
Are there any kernel parameters set?
Does any other software on the cluster nodes compete for the /dev/watchdog device?

Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.

Michiel_1afa · Jun 10, 2026

dakralex said:
Does this happen on a specific cluster node or all cluster nodes?

I happens often on 1 cluster, less often on another (often is 2-3x per month, if we do not touch anything manually)

dakralex said:
What kind of watchdog is used on the node(s) where this happens?

Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'

dakralex said:
Are there any kernel parameters set?

Nothing custom.

dakralex said:
Does any other software on the cluster nodes compete for the /dev/watchdog device?

I hope not, its only proxmox installed there, logs do not seem to indicate any problems as stated in the first post.

dakralex said:
Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.

journalctl -k - added as attachment.
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Code:

Jun 10 08:18:31 pve1 pve-ha-crm[2626]: got crm command: migrate vm:170103 pve1
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: migrate service 'vm:170103' to node 'pve1'
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:19:11 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'migrate' to 'started'  (node = pve1)
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: got crm command: migrate vm:43602 pve1
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: migrate service 'vm:43602' to node 'pve1'
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: service 'vm:43602': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:47:37 pve1 watchdog-mux[1980]: client (PID 16532) did not stop watchdog - disable watchdog updates
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Consumed 29min 50.272s CPU time, 234.8M memory peak.
Jun 10 08:47:38 pve1 watchdog-mux[1980]: exit watchdog-mux with active connections
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Consumed 1.355s CPU time, 2M memory peak.
-- Boot 96a8addbf9aa48ba8572c1d19dd47fe7 --
Jun 10 08:51:27 pve1 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.

The "watchdog update failed - Broken pipe" notice showed up at 8:47:42 in our external monitoring system.

dakralex · Jun 10, 2026

Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?

dakralex · Jun 10, 2026

Michiel_1afa said:
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?

Michiel_1afa · Jun 10, 2026

dakralex said:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?

Thought you might want that: "journalctl -k -b -1" attached.

dakralex said:
Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?

As far as I know yes, we are scrutinizing logs on the daily and im not aware of any errors or warnings on this part. - Did a scan for "unable to acquire lock" on the logs on all servers from saturday till today (5 days) and I have 0 hits. but 3 reboots.

Michiel_1afa · Jun 24, 2026

Hello, any new inputs on this? since the last message we have had one more 'crash' of pvestatd (process kept running, but no more updates)

Johannes S · Jun 25, 2026

How is your network configured?

Michiel_1afa · Jun 26, 2026

Johannes S said:
How is your network configured?

mlag -> 2 bonds -> several vlans.
corosync is configured to use a primary and backup, which are also split across both bonds.
On the corosync part my last error was months ago.

Had another incident yesterday and I do think I got slightly closer to finding what the heck is going on. different datacenter, cluster, but very similar configuration.

Code:

journalctl -b -1 -f

Jun 25 17:07:53 pve28 pvestatd[870814]: status update time (7.681 seconds)
Jun 25 17:08:04 pve28 pvestatd[870814]: status update time (7.861 seconds)
Jun 25 17:08:08 pve28 watchdog-mux[2010]: client (PID 11817) did not stop watchdog - disable watchdog updates
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 25 17:08:08 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Consumed 4d 4h 53min 32.205s CPU time, 202.5M memory peak.
Jun 25 17:08:09 pve28 watchdog-mux[2010]: exit watchdog-mux with active connections
Jun 25 17:08:09 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:09 pve28 kernel: watchdog: watchdog0: watchdog did not stop!

Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.

dakralex · Jun 26, 2026

Michiel_1afa said:
Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.

I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

Michiel_1afa said:
Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'

How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.

Michiel_1afa · Jun 26, 2026

dakralex said:
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.

dakralex said:
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.

echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.

In another status update, removing all my PBS linked storages did shut up pvestatd (which is a good thing)

Michiel_1afa · Jun 29, 2026

More updates from over the weekend: 2 reboots/crashes on 2 clusters.

1. This could be a problem with the server firmware or microcode? This one also gives slightly more information

Code:

Jun 27 11:41:22 pve30 kernel: pve-ha-lrm[113646]: segfault at 8 ip 00005aa1b48e80f9 sp 00007ffca69d7630 error 4 in perl[19a0f9,5aa1b4792000+1ae000] likely on CPU 79 (core 9, socket 1)
Jun 27 11:41:22 pve30 kernel: Code: 05 bd 05 00 00 00 45 31 d2 41 b8 05 00 00 00 89 43 0c 4c 8d 1d 00 fe 1d 00 4b 8b 94 c5 88 07 00 00 48 85 d2 0f 84 17 03 00 00 <48> 8b 02 4b 89 84 c5 88 07 00 00 41 0f b6 03 83 f8 08 0f 83 d7 01
Jun 27 11:41:22 pve30 watchdog-mux[1990]: client (PID 113646) did not stop watchdog - disable watchdog updates
Jun 27 11:41:22 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Consumed 2d 9h 55min 32.416s CPU time, 514.7M memory peak.
Jun 27 11:41:23 pve30 watchdog-mux[1990]: exit watchdog-mux with active connections
Jun 27 11:41:23 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:23 pve30 kernel: watchdog: watchdog0: watchdog did not stop!
Jun 27 11:41:23 pve30 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 27 11:41:23 pve30 systemd[1]: watchdog-mux.service: Consumed 53.642s CPU time, 1.8M memory peak.
Jun 27 11:41:24 pve30 pve-ha-crm[2635]: watchdog update failed - Broken pipe

The next day we got another crash and reboot. from another server, in another cluster. but also a segfault. this time with a few extra lines.

Code:

Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:41 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:49 pve21 kernel: pve-ha-lrm[11408]: segfault at ff00000012 ip 00005c85ec0277d0 sp 00007ffff2845510 error 4 in perl[18e7d0,5c85ebedd000+1ae000] likely on CPU 54 (core 8, socket 0)
Jun 27 18:04:49 pve21 kernel: Code: d7 49 89 cd 48 83 c5 08 45 89 c4 eb 1d 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 dd 48 8b 5d 00 48 85 db 0f 84 90 00 00 00 <0f> be 43 12 44 39 f8 75 e7 48 8b 43 08 41 f6 c4 01 75 05 4c 39 e8
Jun 27 18:04:49 pve21 watchdog-mux[2066]: client (PID 11408) did not stop watchdog - disable watchdog updates
Jun 27 18:04:49 pve21 systemd-journald[1362]: Received client request to sync journal.
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Consumed 5d 18h 36min 19.347s CPU time, 226.4M memory peak.
Jun 27 18:04:50 pve21 watchdog-mux[2066]: exit watchdog-mux with active connections
Jun 27 18:04:50 pve21 systemd-journald[1362]: Received client request to sync journal.
Jun 27 18:04:50 pve21 kernel: watchdog: watchdog0: watchdog did not stop!
-- Boot f83a4174a9d0423f9f2824432c78b6a5 --

The code on 871 is inside the function "sub handle_service_exitcode"

Now this cluster is on a bit older version (9.1.2) which could explain the difference in output, but it feels to me both servers experience the same fault.

Because we have been dealing with this issue for a long time now, we decided to disable HA completely on the 2 affected clusters, we shall see what happens now, but LRM is idle and watchdog inactive so I do assume now we wont get these reboots no more. Unfortunately this also means we can not further investigate, but I think we spend enough time on this problem from our end.

dakralex · Jun 29, 2026

Michiel_1afa said:
That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.

Unfortunately, we don't have any way to reproduce this on our end yet.

One way to be able to have some more information about this would be to look at the process' coredump as a SIGABRT will cause this. See if there is any coredump related to these issues in /var/lib/systemd/coredump/ (see [0] for more information).

Michiel_1afa said:
echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.

Did the hpwdt kernel module cause any issue beforehand? As stated in our documentation [1], the hardware watchdog kernel modules are blocked by default and only need to be loaded if they are actually configured. Does the /etc/default/pve-ha-manager contain anything on the nodes where the issues were happening?

[0] https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html
[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog

dakralex · Jun 29, 2026

Michiel_1afa said:

Code:

Jun 27 11:41:22 pve30 kernel: pve-ha-lrm[113646]: segfault at 8 ip 00005aa1b48e80f9 sp 00007ffca69d7630 error 4 in perl[19a0f9,5aa1b4792000+1ae000] likely on CPU 79 (core 9, socket 1)
Jun 27 11:41:22 pve30 kernel: Code: 05 bd 05 00 00 00 45 31 d2 41 b8 05 00 00 00 89 43 0c 4c 8d 1d 00 fe 1d 00 4b 8b 94 c5 88 07 00 00 48 85 d2 0f 84 17 03 00 00 <48> 8b 02 4b 89 84 c5 88 07 00 00 41 0f b6 03 83 f8 08 0f 83 d7 01
Jun 27 11:41:22 pve30 watchdog-mux[1990]: client (PID 113646) did not stop watchdog - disable watchdog updates
Jun 27 11:41:22 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Consumed 2d 9h 55min 32.416s CPU time, 514.7M memory peak.

The SIGSEGV is a very different cause to end a process here, so that seems quite unrelated to the previous SIGABRT signals. Could you specify the hardware these cluster nodes are running on?

Does a longer-running memtest or a stresstest with e.g. stress-ng show any signs of hardware failure?

Michiel_1afa · Jun 29, 2026

dakralex said:
Unfortunately, we don't have any way to reproduce this on our end yet.

One way to be able to have some more information about this would be to look at the process' coredump as a SIGABRT will cause this. See if there is any coredump related to these issues in /var/lib/systemd/coredump/ (see [0] for more information).

Not anymore, coredump is not installed by default on proxmox, and although I installed it on several nodes today, as mentioned HA is disabled so it wont actually catch anything here.

dakralex said:
Did the hpwdt kernel module cause any issue beforehand? As stated in our documentation [1], the hardware watchdog kernel modules are blocked by default and only need to be loaded if they are actually configured. Does the /etc/default/pve-ha-manager contain anything on the nodes where the issues were happening?

[0] https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html
[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog

Who knows, We have tried every angle we could immagine over the past year

https://forum.proxmox.com/threads/vzdump-namespace-support-when-used-with-pbs.152417 - unsolved - workaround with a custom backup script on our end.
https://forum.proxmox.com/threads/s...er-updated-to-pve-9-1-1-and-pbs-4-0-20.176444 - Solved with a different kernel

X

Thread 'Watchdog Reboots'

Jan 22, 2026

Something we've seen in version 9 is an increase in watchdog reboots - in fact from none to many.

Last few entries of the journal show:

```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task...

X

Thread 'pvestatd crashes every few days'

Apr 28, 2025

Some details

Every few days (at least once a week) the pvestatd service crashes.
I can still log in via the Proxmox GUI (and via ssh), but the containers are all displayed with a "?"
As soon as I restart the pvestadt service (which also works from the GUI), I can see the status of all CT/VMs again.
Most of the CT/VMs are working and running fine, but not all of them.
There is no scheme, sometimes VM X is still running but the services on it are stopped.

Technical Details

- Hardware: Minisforum MS-01
- CPU: 13th Gen Intel(R) Core(TM) i9-13900H (from cat...

and then this topic.

over all this time, we completely replaced hardware (even going from Intel to AMD), replaced networking gear, upgraded every aspect. and the problems remain exactly the same.

Could you specify the hardware these cluster nodes are running on?

These were HP DL380 G10 - Intel and now the main clusters are DL380 G11 - AMD
All servers in these clusters are configured exactly equal.

pve-ha-crm breaking our cluster... again

Michiel_1afa

Well-Known Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Attachments

dakralex

Proxmox Staff Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Attachments

Michiel_1afa

Well-Known Member

Johannes S

Distinguished Member

Michiel_1afa

Well-Known Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Michiel_1afa

Well-Known Member

dakralex

Proxmox Staff Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Thread 'Watchdog Reboots'

Thread 'pvestatd crashes every few days'

We value your privacy