ha-manager stopped updating softdog

bofh666 · Feb 1, 2022

Greetings community!

We have PVE 7.1 cluster of 3 nodes with shared LVM over FC storage and HA groups configured. Nothing special is configured for fencing, i.e. everything is commented out in /etc/default/pve-ha-manager.

Today I've accidentally removed one of multipath devices on one node via dmsetup remove command, so overlying storage went to "question mark" state and all VMs became inoperational on the node. We decided to reboot the node gracefully, it went back online but then started to reboot every ~10 minutes. After we've realized that the node reboots permanently, we decided to migrate VMs which it is running to other nodes, so switched HA groups for them accordingly. Some VMs migrated but another reboot happened and after the node came up it's LRM stuck in "wait for agent lock" state and CRM on master node was complaining it can't find a couple of VMs conf files on the node where they migrated from while they already were on the node where they migrated to, so simple mv between directories in /etc/pve/nodes/*/qemu-server helped to "unstuck" the node and the rest of VMs migrated successfully.

After the node has been drained out, investigation led to journalctl -u watchdog-mux.service:

Code:

-- Boot 4a56ff782af0403e98aa1dbe9513be73 --
Feb 01 14:23:20 pve-n02 systemd[1]: Started Proxmox VE watchdog multiplexer.
Feb 01 14:23:20 pve-n02 watchdog-mux[2438]: Watchdog driver 'Software Watchdog', version 0
Feb 01 14:33:00 pve-n02 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Feb 01 14:33:00 pve-n02 watchdog-mux[2438]: got terminate request
Feb 01 14:33:00 pve-n02 watchdog-mux[2438]: clean exit
Feb 01 14:33:00 pve-n02 systemd[1]: watchdog-mux.service: Succeeded.
Feb 01 14:33:00 pve-n02 systemd[1]: Stopped Proxmox VE watchdog multiplexer.
-- Boot 3c6adf94c9b04fe98308cce835e669f5 --
Feb 01 14:34:26 pve-n02 systemd[1]: Started Proxmox VE watchdog multiplexer.
Feb 01 14:34:26 pve-n02 watchdog-mux[2444]: Watchdog driver 'Software Watchdog', version 0

It's ha-manager which should update softdog, according to https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing, but it's unclear which exact part of the stack does it (does not in my case).

Storage issue has been resolved (not sure if it was relevant at all), no errors in dmesg/journalctl output neither on problem node, nor on master node. The cluster is operational between problem node reboots, status checkmark in cluster summary is green all the time, even when problem node is offline due to reboot, CRM and Corosync, however, detect the change of node states, according to the logs.

Please advise how to troubleshoot the issue and get the node function properly.

UPD: I've decided to disable reboots for further investigation, but corresponding softdog module option don't work and the node still reboots:

Code:

root@pve-n02:~# cat /etc/modprobe.d/softdog.conf
options softdog soft_noboot=1

mira · Feb 4, 2022

If possible, remove all HA resources that are configured on that node.
If no resources are configured, the watchdog should no longer reboot the node. But for this to take effect, you'll have to restart pve-ha-lrm.service and pve-ha-crm.service with systemctl restart pve-ha-lrm.service pve-ha-crm.service or reboot the node.

Once that is done, we can continue investigating the issue. Could it be that the node loses quorum after ~10 minutes?

bofh666 · Feb 5, 2022

Thanks mira!

We've removed all resources from problem node and LRM on it became idle, but reboots persisted. Such behavior made us investigate further and it appeared that previous admin created custom service with weird logic to workaround something with reboot

Not sure if the topic should be marked as solved because solution has nothing to do with PVE, but your suggestions are helpful indeed as HA wiki page lacks information about LRM idle state and it's influence on watchdog behavior.

Search

Search

ha-manager stopped updating softdog

bofh666

Active Member

mira

Proxmox Staff Member

bofh666

Active Member