pve-ha-crm breaking our cluster... again

Michiel_1afa · Jun 10, 2026

3x this week already - last saturday, yesterday and this morning

pve-ha-crm decides to die without cause, or at least without a usable message.

"watchdog update failed - Broken pipe"

The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.

journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster

system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.

What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.

As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.

dakralex · Jun 10, 2026

Hi!

Does this happen on a specific cluster node or all cluster nodes?
What kind of watchdog is used on the node(s) where this happens?
Are there any kernel parameters set?
Does any other software on the cluster nodes compete for the /dev/watchdog device?

Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.

Michiel_1afa · Jun 10, 2026

dakralex said:
Does this happen on a specific cluster node or all cluster nodes?

I happens often on 1 cluster, less often on another (often is 2-3x per month, if we do not touch anything manually)

dakralex said:
What kind of watchdog is used on the node(s) where this happens?

Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'

dakralex said:
Are there any kernel parameters set?

Nothing custom.

dakralex said:
Does any other software on the cluster nodes compete for the /dev/watchdog device?

I hope not, its only proxmox installed there, logs do not seem to indicate any problems as stated in the first post.

dakralex said:
Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.

journalctl -k - added as attachment.
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Code:

Jun 10 08:18:31 pve1 pve-ha-crm[2626]: got crm command: migrate vm:170103 pve1
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: migrate service 'vm:170103' to node 'pve1'
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:19:11 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'migrate' to 'started'  (node = pve1)
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: got crm command: migrate vm:43602 pve1
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: migrate service 'vm:43602' to node 'pve1'
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: service 'vm:43602': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:47:37 pve1 watchdog-mux[1980]: client (PID 16532) did not stop watchdog - disable watchdog updates
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Consumed 29min 50.272s CPU time, 234.8M memory peak.
Jun 10 08:47:38 pve1 watchdog-mux[1980]: exit watchdog-mux with active connections
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Consumed 1.355s CPU time, 2M memory peak.
-- Boot 96a8addbf9aa48ba8572c1d19dd47fe7 --
Jun 10 08:51:27 pve1 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.

The "watchdog update failed - Broken pipe" notice showed up at 8:47:42 in our external monitoring system.

dakralex · Jun 10, 2026

Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?

dakralex · Jun 10, 2026

Michiel_1afa said:
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?

Michiel_1afa · Jun 10, 2026

dakralex said:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?

Thought you might want that: "journalctl -k -b -1" attached.

dakralex said:
Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?

As far as I know yes, we are scrutinizing logs on the daily and im not aware of any errors or warnings on this part. - Did a scan for "unable to acquire lock" on the logs on all servers from saturday till today (5 days) and I have 0 hits. but 3 reboots.

Michiel_1afa · Wednesday at 14:48

Hello, any new inputs on this? since the last message we have had one more 'crash' of pvestatd (process kept running, but no more updates)

Johannes S · 2026-06-25T19:43:53+0200

How is your network configured?

Michiel_1afa · 2026-06-26T08:25:34+0200

Johannes S said:
How is your network configured?

mlag -> 2 bonds -> several vlans.
corosync is configured to use a primary and backup, which are also split across both bonds.
On the corosync part my last error was months ago.

Had another incident yesterday and I do think I got slightly closer to finding what the heck is going on. different datacenter, cluster, but very similar configuration.

Code:

journalctl -b -1 -f

Jun 25 17:07:53 pve28 pvestatd[870814]: status update time (7.681 seconds)
Jun 25 17:08:04 pve28 pvestatd[870814]: status update time (7.861 seconds)
Jun 25 17:08:08 pve28 watchdog-mux[2010]: client (PID 11817) did not stop watchdog - disable watchdog updates
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 25 17:08:08 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Consumed 4d 4h 53min 32.205s CPU time, 202.5M memory peak.
Jun 25 17:08:09 pve28 watchdog-mux[2010]: exit watchdog-mux with active connections
Jun 25 17:08:09 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:09 pve28 kernel: watchdog: watchdog0: watchdog did not stop!

Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.

dakralex · 2026-06-26T13:53:09+0200

Michiel_1afa said:
Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.

I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

Michiel_1afa said:
Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'

How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.

Michiel_1afa · 2026-06-26T14:00:38+0200

dakralex said:
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.

dakralex said:
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.

echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.

In another status update, removing all my PBS linked storages did shut up pvestatd (which is a good thing)

pve-ha-crm breaking our cluster... again

Michiel_1afa

Well-Known Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Attachments

dakralex

Proxmox Staff Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

Attachments

Michiel_1afa

Well-Known Member

Johannes S

Distinguished Member

Michiel_1afa

Well-Known Member

dakralex

Proxmox Staff Member

Michiel_1afa

Well-Known Member

We value your privacy