pve-ha-crm breaking our cluster... again

Michiel_1afa

Well-Known Member
Mar 5, 2021
41
14
48
43
3x this week already - last saturday, yesterday and this morning

pve-ha-crm decides to die without cause, or at least without a usable message.

"watchdog update failed - Broken pipe"

The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.

journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster

system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.

What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.

As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.
 
Last edited:
Hi!

Does this happen on a specific cluster node or all cluster nodes?
What kind of watchdog is used on the node(s) where this happens?
Are there any kernel parameters set?
Does any other software on the cluster nodes compete for the /dev/watchdog device?

Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
 
Does this happen on a specific cluster node or all cluster nodes?
I happens often on 1 cluster, less often on another (often is 2-3x per month, if we do not touch anything manually)
What kind of watchdog is used on the node(s) where this happens?
Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'
Are there any kernel parameters set?
Nothing custom.
Does any other software on the cluster nodes compete for the /dev/watchdog device?
I hope not, its only proxmox installed there, logs do not seem to indicate any problems as stated in the first post.
Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
journalctl -k - added as attachment.
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Code:
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: got crm command: migrate vm:170103 pve1
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: migrate service 'vm:170103' to node 'pve1'
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:19:11 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'migrate' to 'started'  (node = pve1)
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: got crm command: migrate vm:43602 pve1
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: migrate service 'vm:43602' to node 'pve1'
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: service 'vm:43602': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:47:37 pve1 watchdog-mux[1980]: client (PID 16532) did not stop watchdog - disable watchdog updates
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Consumed 29min 50.272s CPU time, 234.8M memory peak.
Jun 10 08:47:38 pve1 watchdog-mux[1980]: exit watchdog-mux with active connections
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Consumed 1.355s CPU time, 2M memory peak.
-- Boot 96a8addbf9aa48ba8572c1d19dd47fe7 --
Jun 10 08:51:27 pve1 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.

The "watchdog update failed - Broken pipe" notice showed up at 8:47:42 in our external monitoring system.
 

Attachments

Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
 
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47
Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
 
Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
Thought you might want that: "journalctl -k -b -1" attached.

Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
As far as I know yes, we are scrutinizing logs on the daily and im not aware of any errors or warnings on this part. - Did a scan for "unable to acquire lock" on the logs on all servers from saturday till today (5 days) and I have 0 hits. but 3 reboots.
 

Attachments

Last edited:
How is your network configured?
mlag -> 2 bonds -> several vlans.
corosync is configured to use a primary and backup, which are also split across both bonds.
On the corosync part my last error was months ago.

Had another incident yesterday and I do think I got slightly closer to finding what the heck is going on. different datacenter, cluster, but very similar configuration.

Code:
journalctl -b -1 -f

Jun 25 17:07:53 pve28 pvestatd[870814]: status update time (7.681 seconds)
Jun 25 17:08:04 pve28 pvestatd[870814]: status update time (7.861 seconds)
Jun 25 17:08:08 pve28 watchdog-mux[2010]: client (PID 11817) did not stop watchdog - disable watchdog updates
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 25 17:08:08 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:08 pve28 systemd[1]: pve-ha-lrm.service: Consumed 4d 4h 53min 32.205s CPU time, 202.5M memory peak.
Jun 25 17:08:09 pve28 watchdog-mux[2010]: exit watchdog-mux with active connections
Jun 25 17:08:09 pve28 systemd-journald[1336]: Received client request to sync journal.
Jun 25 17:08:09 pve28 kernel: watchdog: watchdog0: watchdog did not stop!

Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.
 
  • Like
Reactions: Johannes S
Did notice another potential problem yesterday when I was rebooting our PBS server in that same location. pvestatd at that point did throw some errors about the unreachable storage and would almost time out for that reason but I was there in time to disable all the pbs storage to sort out that problem, now in the progress of changing my backups to enable and disable storages before and after backups.
Though this was an issue im unsure if it is directly related.
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...

Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.
 
I don't think that pvestatd is related to this issue, because neither pve-ha-crm nor pve-ha-lrm directly depend on pvestatd. I haven't found any clues why pve-ha-lrm in particular is killed with SIGABRT...
That is my aim as well, and why I was asking if its feasible to debug that daemon in some form, the Abort seems to come out of nowhere but has quite heavy consequences.
How was the HP modules blacklisted? Was the pve-ha-lrm.service systemd unit changed in any way? I wonder why the pve-ha-lrm has exited with a SIGABRT.
echo "blacklist hpwdt" > /etc/modprobe.d/blacklist-hp.conf
and updateinitramfs..
However this was only applied on 1 cluster as we were doing different tests, clusters with and without the blacklisting are still acting up.

In another status update, removing all my PBS linked storages did shut up pvestatd (which is a good thing)
 
Last edited: