pve-ha-crm breaking our cluster... again

Michiel_1afa

Well-Known Member
Mar 5, 2021
43
14
48
43
3x this week already - last saturday, yesterday and this morning

pve-ha-crm decides to die without cause, or at least without a usable message.

"watchdog update failed - Broken pipe"

The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.

journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster

system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.

What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.

As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.
 
Last edited:
Hi!

Does this happen on a specific cluster node or all cluster nodes?
What kind of watchdog is used on the node(s) where this happens?
Are there any kernel parameters set?
Does any other software on the cluster nodes compete for the /dev/watchdog device?

Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
 
Does this happen on a specific cluster node or all cluster nodes?
I happens often on 1 cluster, less often on another (often is 2-3x per month, if we do not touch anything manually)
What kind of watchdog is used on the node(s) where this happens?
Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'
Are there any kernel parameters set?
Nothing custom.
Does any other software on the cluster nodes compete for the /dev/watchdog device?
I hope not, its only proxmox installed there, logs do not seem to indicate any problems as stated in the first post.
Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
journalctl -k - added as attachment.
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Code:
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: got crm command: migrate vm:170103 pve1
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: migrate service 'vm:170103' to node 'pve1'
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:19:11 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'migrate' to 'started'  (node = pve1)
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: got crm command: migrate vm:43602 pve1
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: migrate service 'vm:43602' to node 'pve1'
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: service 'vm:43602': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:47:37 pve1 watchdog-mux[1980]: client (PID 16532) did not stop watchdog - disable watchdog updates
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Consumed 29min 50.272s CPU time, 234.8M memory peak.
Jun 10 08:47:38 pve1 watchdog-mux[1980]: exit watchdog-mux with active connections
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Consumed 1.355s CPU time, 2M memory peak.
-- Boot 96a8addbf9aa48ba8572c1d19dd47fe7 --
Jun 10 08:51:27 pve1 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.

The "watchdog update failed - Broken pipe" notice showed up at 8:47:42 in our external monitoring system.
 

Attachments

Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
 
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47
Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
 
Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
Thought you might want that: "journalctl -k -b -1" attached.

Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
As far as I know yes, we are scrutinizing logs on the daily and im not aware of any errors or warnings on this part. - Did a scan for "unable to acquire lock" on the logs on all servers from saturday till today (5 days) and I have 0 hits. but 3 reboots.
 

Attachments

Last edited:
How is your network configured?
mlag -> 2 bonds -> several vlans.
corosync is configured to use a primary and backup, which are also split across both bonds.
On the corosync part my last error was months ago.

Had another incident yesterday and I do think I got slightly closer to finding what the heck is going on. different datacenter, cluster, but very similar configuration.

Code:
journalctl -b -1 -f

Jun 25 17:07:53 pve28 pvestatd[870814]: status update time (7.681 seconds)
Jun 25 17:08:04 pve28 pvestatd[870814]: status update time (7.861 seconds)
Jun 25 17:08:08 pve28 watchdog-mux[2010]: client (PID 11817) did not stop watchdog - disable watchdog updates
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 25 17:08:08 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Consumed 4d 4h 53min 32.205s CPU time, 202.5M memory peak.
Jun 25 17:08:09 pve28 watchdog-mux[2010]: exit watchdog-mux with active connections
Jun 25 17:08:09 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:09 pve28 kernel: watchdog: watchdog0: watchdog did not stop!

Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.
 
  • Like
Reactions: Johannes S
Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.
 
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...
That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.
echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.

In another status update, removing all my PBS linked storages did shut up pvestatd (which is a good thing)
 
Last edited:
More updates from over the weekend: 2 reboots/crashes on 2 clusters.

1. This could be a problem with the server firmware or microcode? This one also gives slightly more information

Code:
Jun 27 11:41:22 pve30 kernel: pve-ha-lrm[113646]: segfault at 8 ip 00005aa1b48e80f9 sp 00007ffca69d7630 error 4 in perl[19a0f9,5aa1b4792000+1ae000] likely on CPU 79 (core 9, socket 1)
Jun 27 11:41:22 pve30 kernel: Code: 05 bd 05 00 00 00 45 31 d2 41 b8 05 00 00 00 89 43 0c 4c 8d 1d 00 fe 1d 00 4b 8b 94 c5 88 07 00 00 48 85 d2 0f 84 17 03 00 00 <48> 8b 02 4b 89 84 c5 88 07 00 00 41 0f b6 03 83 f8 08 0f 83 d7 01
Jun 27 11:41:22 pve30 watchdog-mux[1990]: client (PID 113646) did not stop watchdog - disable watchdog updates
Jun 27 11:41:22 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Consumed 2d 9h 55min 32.416s CPU time, 514.7M memory peak.
Jun 27 11:41:23 pve30 watchdog-mux[1990]: exit watchdog-mux with active connections
Jun 27 11:41:23 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:23 pve30 kernel: watchdog: watchdog0: watchdog did not stop!
Jun 27 11:41:23 pve30 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 27 11:41:23 pve30 systemd[1]: watchdog-mux.service: Consumed 53.642s CPU time, 1.8M memory peak.
Jun 27 11:41:24 pve30 pve-ha-crm[2635]: watchdog update failed - Broken pipe

The next day we got another crash and reboot. from another server, in another cluster. but also a segfault. this time with a few extra lines.
Code:
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:39 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:41 pve21 pve-ha-lrm[11408]: Attempt to free unreferenced scalar: SV 0x5c861557a3f0, Perl interpreter: 0x5c8612dd72a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
Jun 27 18:04:49 pve21 kernel: pve-ha-lrm[11408]: segfault at ff00000012 ip 00005c85ec0277d0 sp 00007ffff2845510 error 4 in perl[18e7d0,5c85ebedd000+1ae000] likely on CPU 54 (core 8, socket 0)
Jun 27 18:04:49 pve21 kernel: Code: d7 49 89 cd 48 83 c5 08 45 89 c4 eb 1d 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 dd 48 8b 5d 00 48 85 db 0f 84 90 00 00 00 <0f> be 43 12 44 39 f8 75 e7 48 8b 43 08 41 f6 c4 01 75 05 4c 39 e8
Jun 27 18:04:49 pve21 watchdog-mux[2066]: client (PID 11408) did not stop watchdog - disable watchdog updates
Jun 27 18:04:49 pve21 systemd-journald[1362]: Received client request to sync journal.
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 18:04:49 pve21 systemd[1]: pve-ha-lrm.service: Consumed 5d 18h 36min 19.347s CPU time, 226.4M memory peak.
Jun 27 18:04:50 pve21 watchdog-mux[2066]: exit watchdog-mux with active connections
Jun 27 18:04:50 pve21 systemd-journald[1362]: Received client request to sync journal.
Jun 27 18:04:50 pve21 kernel: watchdog: watchdog0: watchdog did not stop!
-- Boot f83a4174a9d0423f9f2824432c78b6a5 --

The code on 871 is inside the function "sub handle_service_exitcode"

Now this cluster is on a bit older version (9.1.2) which could explain the difference in output, but it feels to me both servers experience the same fault.

Because we have been dealing with this issue for a long time now, we decided to disable HA completely on the 2 affected clusters, we shall see what happens now, but LRM is idle and watchdog inactive so I do assume now we wont get these reboots no more. Unfortunately this also means we can not further investigate, but I think we spend enough time on this problem from our end.
 
Last edited:
That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.
Unfortunately, we don't have any way to reproduce this on our end yet.

One way to be able to have some more information about this would be to look at the process' coredump as a SIGABRT will cause this. See if there is any coredump related to these issues in /var/lib/systemd/coredump/ (see [0] for more information).

echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.
Did the hpwdt kernel module cause any issue beforehand? As stated in our documentation [1], the hardware watchdog kernel modules are blocked by default and only need to be loaded if they are actually configured. Does the /etc/default/pve-ha-manager contain anything on the nodes where the issues were happening?

[0] https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html
[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog
 
Code:
Jun 27 11:41:22 pve30 kernel: pve-ha-lrm[113646]: segfault at 8 ip 00005aa1b48e80f9 sp 00007ffca69d7630 error 4 in perl[19a0f9,5aa1b4792000+1ae000] likely on CPU 79 (core 9, socket 1)
Jun 27 11:41:22 pve30 kernel: Code: 05 bd 05 00 00 00 45 31 d2 41 b8 05 00 00 00 89 43 0c 4c 8d 1d 00 fe 1d 00 4b 8b 94 c5 88 07 00 00 48 85 d2 0f 84 17 03 00 00 <48> 8b 02 4b 89 84 c5 88 07 00 00 41 0f b6 03 83 f8 08 0f 83 d7 01
Jun 27 11:41:22 pve30 watchdog-mux[1990]: client (PID 113646) did not stop watchdog - disable watchdog updates
Jun 27 11:41:22 pve30 systemd-journald[1305]: Received client request to sync journal.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 27 11:41:22 pve30 systemd[1]: pve-ha-lrm.service: Consumed 2d 9h 55min 32.416s CPU time, 514.7M memory peak.
The SIGSEGV is a very different cause to end a process here, so that seems quite unrelated to the previous SIGABRT signals. Could you specify the hardware these cluster nodes are running on?

Does a longer-running memtest or a stresstest with e.g. stress-ng show any signs of hardware failure?
 
Last edited:
Unfortunately, we don't have any way to reproduce this on our end yet.

One way to be able to have some more information about this would be to look at the process' coredump as a SIGABRT will cause this. See if there is any coredump related to these issues in /var/lib/systemd/coredump/ (see [0] for more information).
Not anymore, coredump is not installed by default on proxmox, and although I installed it on several nodes today, as mentioned HA is disabled so it wont actually catch anything here.
Did the hpwdt kernel module cause any issue beforehand? As stated in our documentation [1], the hardware watchdog kernel modules are blocked by default and only need to be loaded if they are actually configured. Does the /etc/default/pve-ha-manager contain anything on the nodes where the issues were happening?

[0] https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html
[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog
Who knows, We have tried every angle we could immagine over the past year

https://forum.proxmox.com/threads/vzdump-namespace-support-when-used-with-pbs.152417 - unsolved - workaround with a custom backup script on our end.
https://forum.proxmox.com/threads/s...er-updated-to-pve-9-1-1-and-pbs-4-0-20.176444 - Solved with a different kernel

and then this topic.

over all this time, we completely replaced hardware (even going from Intel to AMD), replaced networking gear, upgraded every aspect. and the problems remain exactly the same.

Could you specify the hardware these cluster nodes are running on?
These were HP DL380 G10 - Intel and now the main clusters are DL380 G11 - AMD
All servers in these clusters are configured exactly equal.
 
Last edited: