pve-ha-crm breaking our cluster... again

Michiel_1afa

Well-Known Member
Mar 5, 2021
38
13
48
43
3x this week already - last saturday, yesterday and this morning

pve-ha-crm decides to die without cause, or at least without a usable message.

"watchdog update failed - Broken pipe"

The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.

journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster

system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.

What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.

As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.
 
Last edited:
Hi!

Does this happen on a specific cluster node or all cluster nodes?
What kind of watchdog is used on the node(s) where this happens?
Are there any kernel parameters set?
Does any other software on the cluster nodes compete for the /dev/watchdog device?

Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
 
Does this happen on a specific cluster node or all cluster nodes?
I happens often on 1 cluster, less often on another (often is 2-3x per month, if we do not touch anything manually)
What kind of watchdog is used on the node(s) where this happens?
Softdog, were on HP machines so the HP modules are already blacklisted per earlier 'solution'
Are there any kernel parameters set?
Nothing custom.
Does any other software on the cluster nodes compete for the /dev/watchdog device?
I hope not, its only proxmox installed there, logs do not seem to indicate any problems as stated in the first post.
Otherwise, it would still be good to have the output of journalctl -k and journalctl -u corosync -u watchdog-mux -u pve-ha-crm -u pve-ha-lrm in the time frame where this happens for context on the nodes where this happens.
journalctl -k - added as attachment.
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47

Code:
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: got crm command: migrate vm:170103 pve1
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: migrate service 'vm:170103' to node 'pve1'
Jun 10 08:18:31 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:19:11 pve1 pve-ha-crm[2626]: service 'vm:170103': state changed from 'migrate' to 'started'  (node = pve1)
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: got crm command: migrate vm:43602 pve1
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: migrate service 'vm:43602' to node 'pve1'
Jun 10 08:46:52 pve1 pve-ha-crm[2626]: service 'vm:43602': state changed from 'started' to 'migrate'  (node = pve3, target = pve1)
Jun 10 08:47:37 pve1 watchdog-mux[1980]: client (PID 16532) did not stop watchdog - disable watchdog updates
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=6/ABRT
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
Jun 10 08:47:37 pve1 systemd[1]: pve-ha-lrm.service: Consumed 29min 50.272s CPU time, 234.8M memory peak.
Jun 10 08:47:38 pve1 watchdog-mux[1980]: exit watchdog-mux with active connections
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jun 10 08:47:38 pve1 systemd[1]: watchdog-mux.service: Consumed 1.355s CPU time, 2M memory peak.
-- Boot 96a8addbf9aa48ba8572c1d19dd47fe7 --
Jun 10 08:51:27 pve1 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.

The "watchdog update failed - Broken pipe" notice showed up at 8:47:42 in our external monitoring system.
 

Attachments

Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
 
journalctl -u.... below (ofc have data before this morning 8:18 but that will be irrelivant) below is everything from 8 am till the reboot 8:47
Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
 
Last edited:
Thanks, could you also provide the kernel syslog for the boot before the node rebooted itself?
Thought you might want that: "journalctl -k -b -1" attached.

Does the pve-ha-lrm service acquire its lock on pve1 and pve3 correctly?
As far as I know yes, we are scrutinizing logs on the daily and im not aware of any errors or warnings on this part. - Did a scan for "unable to acquire lock" on the logs on all servers from saturday till today (5 days) and I have 0 hits. but 3 reboots.
 

Attachments

Last edited: