Greetings community!
We have PVE 7.1 cluster of 3 nodes with shared LVM over FC storage and HA groups configured. Nothing special is configured for fencing, i.e. everything is commented out in
Today I've accidentally removed one of multipath devices on one node via
After the node has been drained out, investigation led to
It's ha-manager which should update softdog, according to https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing, but it's unclear which exact part of the stack does it (does not in my case).
Storage issue has been resolved (not sure if it was relevant at all), no errors in dmesg/journalctl output neither on problem node, nor on master node. The cluster is operational between problem node reboots, status checkmark in cluster summary is green all the time, even when problem node is offline due to reboot, CRM and Corosync, however, detect the change of node states, according to the logs.
Please advise how to troubleshoot the issue and get the node function properly.
UPD: I've decided to disable reboots for further investigation, but corresponding softdog module option don't work and the node still reboots:
We have PVE 7.1 cluster of 3 nodes with shared LVM over FC storage and HA groups configured. Nothing special is configured for fencing, i.e. everything is commented out in
/etc/default/pve-ha-manager
.Today I've accidentally removed one of multipath devices on one node via
dmsetup remove
command, so overlying storage went to "question mark" state and all VMs became inoperational on the node. We decided to reboot the node gracefully, it went back online but then started to reboot every ~10 minutes. After we've realized that the node reboots permanently, we decided to migrate VMs which it is running to other nodes, so switched HA groups for them accordingly. Some VMs migrated but another reboot happened and after the node came up it's LRM stuck in "wait for agent lock" state and CRM on master node was complaining it can't find a couple of VMs conf files on the node where they migrated from while they already were on the node where they migrated to, so simple mv
between directories in /etc/pve/nodes/*/qemu-server
helped to "unstuck" the node and the rest of VMs migrated successfully.After the node has been drained out, investigation led to
journalctl -u watchdog-mux.service
:
Code:
-- Boot 4a56ff782af0403e98aa1dbe9513be73 --
Feb 01 14:23:20 pve-n02 systemd[1]: Started Proxmox VE watchdog multiplexer.
Feb 01 14:23:20 pve-n02 watchdog-mux[2438]: Watchdog driver 'Software Watchdog', version 0
Feb 01 14:33:00 pve-n02 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Feb 01 14:33:00 pve-n02 watchdog-mux[2438]: got terminate request
Feb 01 14:33:00 pve-n02 watchdog-mux[2438]: clean exit
Feb 01 14:33:00 pve-n02 systemd[1]: watchdog-mux.service: Succeeded.
Feb 01 14:33:00 pve-n02 systemd[1]: Stopped Proxmox VE watchdog multiplexer.
-- Boot 3c6adf94c9b04fe98308cce835e669f5 --
Feb 01 14:34:26 pve-n02 systemd[1]: Started Proxmox VE watchdog multiplexer.
Feb 01 14:34:26 pve-n02 watchdog-mux[2444]: Watchdog driver 'Software Watchdog', version 0
Storage issue has been resolved (not sure if it was relevant at all), no errors in dmesg/journalctl output neither on problem node, nor on master node. The cluster is operational between problem node reboots, status checkmark in cluster summary is green all the time, even when problem node is offline due to reboot, CRM and Corosync, however, detect the change of node states, according to the logs.
Please advise how to troubleshoot the issue and get the node function properly.
UPD: I've decided to disable reboots for further investigation, but corresponding softdog module option don't work and the node still reboots:
Code:
root@pve-n02:~# cat /etc/modprobe.d/softdog.conf
options softdog soft_noboot=1
Last edited: