proxmox 7 pve-ha-lrm failed

Mad_Max · Jul 8, 2021

After the upgrade, an error occurred on one of the nodes, the VM located on this node and in the group are not moved to other nodes in the cluster.
when the service is restarted, it is restored, but when you try to move the VM to another node, it crashes again. If you exclude the VM from the HA, then it is transferred without problems. On the rest of the recognitions, this problem is not observed.

please help, my Linux/Proxmox skill is very small (((.

root@vp1:~# ha-manager status
quorum OK
master vp4 (active, Thu Jul 8 15:59:04 2021)
lrm master (idle, Thu Jul 8 15:59:11 2021)
lrm vp1 (old timestamp - dead?, Thu Jul 8 15:41:28 2021)
lrm vp2 (active, Thu Jul 8 15:59:14 2021)
lrm vp3 (active, Thu Jul 8 15:59:05 2021)
lrm vp4 (active, Thu Jul 8 15:59:04 2021)
service vm:100 (vp2, started)
service vm:101 (vp3, started)
service vm:102 (vp4, started)
service vm:103 (vp3, started)
service vm:104 (vp1, freeze) -- this machine is on the failed node and gets the status freze every time as a paraet lrm

Code:

-- Boot 0804184aa15d464695d6acc11ae0d013 --
Jul 08 14:26:48 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 14:26:49 vp1 pve-ha-lrm[2907]: starting server
Jul 08 14:26:49 vp1 pve-ha-lrm[2907]: status change startup => wait_for_agent_lock
Jul 08 14:26:49 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: restart LRM, freeze all services
Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: server stopped
Jul 08 14:33:01 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 14:33:01 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Jul 08 14:42:22 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 14:42:23 vp1 pve-ha-lrm[16360]: starting server
Jul 08 14:42:23 vp1 pve-ha-lrm[16360]: status change startup => wait_for_agent_lock
Jul 08 14:42:23 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 14:42:29 vp1 pve-ha-lrm[16360]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 14:42:29 vp1 pve-ha-lrm[16360]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 14:42:29 vp1 pve-ha-lrm[16360]: restart LRM, freeze all services
Jul 08 14:42:29 vp1 pve-ha-lrm[16360]: server stopped
Jul 08 14:42:29 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 14:42:29 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Jul 08 14:43:06 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 14:43:07 vp1 pve-ha-lrm[17441]: starting server
Jul 08 14:43:07 vp1 pve-ha-lrm[17441]: status change startup => wait_for_agent_lock
Jul 08 14:43:07 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 14:44:33 vp1 pve-ha-lrm[17441]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 14:44:33 vp1 pve-ha-lrm[17441]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 14:44:33 vp1 pve-ha-lrm[17441]: restart LRM, freeze all services
Jul 08 14:44:33 vp1 pve-ha-lrm[17441]: server stopped
Jul 08 14:44:33 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 14:44:33 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Jul 08 15:24:21 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 15:24:22 vp1 pve-ha-lrm[60122]: starting server
Jul 08 15:24:22 vp1 pve-ha-lrm[60122]: status change startup => wait_for_agent_lock
Jul 08 15:24:22 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 15:24:44 vp1 systemd[1]: Stopping PVE Local HA Resource Manager Daemon...
Jul 08 15:24:45 vp1 pve-ha-lrm[60122]: received signal TERM
Jul 08 15:24:45 vp1 pve-ha-lrm[60122]: restart LRM, freeze all services
Jul 08 15:24:45 vp1 pve-ha-lrm[60122]: server stopped
Jul 08 15:24:46 vp1 systemd[1]: pve-ha-lrm.service: Succeeded.
Jul 08 15:24:46 vp1 systemd[1]: Stopped PVE Local HA Resource Manager Daemon.
Jul 08 15:24:46 vp1 systemd[1]: pve-ha-lrm.service: Consumed 1.408s CPU time.
Jul 08 15:24:46 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 15:24:47 vp1 pve-ha-lrm[60684]: starting server
Jul 08 15:24:47 vp1 pve-ha-lrm[60684]: status change startup => wait_for_agent_lock
Jul 08 15:24:47 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 15:33:25 vp1 pve-ha-lrm[60684]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 15:33:25 vp1 pve-ha-lrm[60684]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 15:33:25 vp1 pve-ha-lrm[60684]: restart LRM, freeze all services
Jul 08 15:33:25 vp1 pve-ha-lrm[60684]: server stopped
Jul 08 15:33:25 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 15:33:25 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Jul 08 15:38:08 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 15:38:09 vp1 pve-ha-lrm[71994]: starting server
Jul 08 15:38:09 vp1 pve-ha-lrm[71994]: status change startup => wait_for_agent_lock
Jul 08 15:38:09 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 15:39:25 vp1 pve-ha-lrm[71994]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 15:39:25 vp1 pve-ha-lrm[71994]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 15:39:25 vp1 pve-ha-lrm[71994]: restart LRM, freeze all services
Jul 08 15:39:25 vp1 pve-ha-lrm[71994]: server stopped
Jul 08 15:39:25 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 15:39:25 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Jul 08 15:41:12 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 08 15:41:12 vp1 pve-ha-lrm[74821]: starting server
Jul 08 15:41:12 vp1 pve-ha-lrm[74821]: status change startup => wait_for_agent_lock
Jul 08 15:41:12 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 08 15:41:28 vp1 pve-ha-lrm[74821]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 08 15:41:28 vp1 pve-ha-lrm[74821]: ERROR: unable to open watchdog socket - No such file or directory
Jul 08 15:41:28 vp1 pve-ha-lrm[74821]: restart LRM, freeze all services
Jul 08 15:41:28 vp1 pve-ha-lrm[74821]: server stopped
Jul 08 15:41:28 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 08 15:41:28 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.

t.lamprecht · Jul 8, 2021

Hi,

Mad_Max said:
Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: successfully acquired lock 'ha_agent_vp1_lock' Jul 08 14:33:01 vp1 pve-ha-lrm[2907]: ERROR: unable to open watchdog socket - No such file or directory

The LRM sees that there's work to do and tries to get active, but then it seems that the watchdog-mux.service did not came up on that node and so the LRM fails when trying to connect to it (its a hard requirement for self-fencing).

Can you post the output of systemctl status watchdog-mux.service

Mad_Max · Jul 9, 2021

Code:

● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
     Active: failed (Result: exit-code) since Thu 2021-07-08 21:18:30 MSK; 9h ago
    Process: 332594 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
   Main PID: 332594 (code=exited, status=1/FAILURE)
        CPU: 2ms

Jul 08 21:18:30 vp1 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jul 08 21:18:30 vp1 watchdog-mux[332594]: watchdog set timeout: Invalid argument
Jul 08 21:18:30 vp1 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jul 08 21:18:30 vp1 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.

how can i restore this?

t.lamprecht · Jul 9, 2021

What's the server general vendor and what CPU is in it?

We need first to find out which watchdog module is loaded to make sense of that error.
So can you please also post the output of lsmod here, so I can check the list of loaded modules?

After we addressed whatever is wrong with the watchdog module you can just start that service again.

Mad_Max · Jul 9, 2021

Code:

Module                  Size  Used by
tcp_diag               16384  0
inet_diag              24576  1 tcp_diag
ceph                  442368  1
libceph               409600  1 ceph
fscache               380928  1 ceph
ebtable_filter         16384  0
ebtables               36864  1 ebtable_filter
ip_set                 53248  0
ip6table_raw           16384  0
iptable_raw            16384  0
ip6table_filter        16384  0
ip6_tables             32768  2 ip6table_filter,ip6table_raw
sctp                  356352  2
ip6_udp_tunnel         16384  1 sctp
udp_tunnel             20480  1 sctp
iptable_filter         16384  0
bpfilter               16384  0
bonding               172032  0
tls                    90112  1 bonding
nfnetlink_log          20480  1
nfnetlink              20480  3 ip_set,nfnetlink_log
snd_hda_codec_hdmi     65536  1
intel_rapl_msr         20480  0
intel_rapl_common      24576  1 intel_rapl_msr
sb_edac                24576  0
snd_hda_codec_realtek   143360  1
snd_hda_codec_generic    81920  1 snd_hda_codec_realtek
x86_pkg_temp_thermal    20480  0
intel_powerclamp       20480  0
ledtrig_audio          16384  1 snd_hda_codec_generic
snd_hda_intel          53248  0
snd_intel_dspcfg       28672  1 snd_hda_intel
soundwire_intel        40960  1 snd_intel_dspcfg
soundwire_generic_allocation    16384  1 soundwire_intel
soundwire_cadence      32768  1 soundwire_intel
kvm_intel             282624  12
nouveau              2002944  1
video                  49152  1 nouveau
drm_ttm_helper         16384  1 nouveau
snd_hda_codec         147456  4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
kvm                   823296  1 kvm_intel
ttm                    73728  2 drm_ttm_helper,nouveau
snd_hda_core           94208  5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
drm_kms_helper        245760  1 nouveau
snd_hwdep              16384  1 snd_hda_codec
soundwire_bus          77824  3 soundwire_intel,soundwire_generic_allocation,soundwire_cadence
snd_soc_core          286720  1 soundwire_intel
cec                    53248  1 drm_kms_helper
snd_compress           24576  1 snd_soc_core
ac97_bus               16384  1 snd_soc_core
snd_pcm_dmaengine      16384  1 snd_soc_core
irqbypass              16384  4 kvm
snd_pcm               118784  8 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,soundwire_intel,snd_compress,snd_soc_core,snd_hda_core,snd_pcm_dmaengine
rc_core                57344  1 cec
crct10dif_pclmul       16384  1
ghash_clmulni_intel    16384  0
fb_sys_fops            16384  1 drm_kms_helper
syscopyarea            16384  1 drm_kms_helper
snd_timer              40960  1 snd_pcm
sysfillrect            16384  1 drm_kms_helper
aesni_intel           372736  8
sysimgblt              16384  1 drm_kms_helper
snd                    94208  10 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_compress,snd_soc_core,snd_pcm
soundcore              16384  1 snd
crypto_simd            16384  1 aesni_intel
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
glue_helper            16384  1 aesni_intel
rapl                   20480  0
ioatdma                57344  0
intel_wmi_thunderbolt    20480  0
intel_cstate           20480  0
pcspkr                 16384  0
joydev                 28672  0
input_leds             16384  0
efi_pstore             16384  0
mxm_wmi                16384  1 nouveau
zfs                  4186112  6
zunicode              331776  1 zfs
mac_hid                16384  0
zzstd                 532480  1 zfs
zlua                  151552  1 zfs
zavl                   16384  1 zfs
icp                   294912  1 zfs
zcommon                98304  2 zfs,icp
znvpair                98304  2 zfs,zcommon
spl                   102400  6 zfs,icp,zzstd,znvpair,zcommon,zavl
vhost_net              32768  1
vhost                  53248  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    24576  1 vhost_net
ib_iser                40960  0
rdma_cm               118784  1 ib_iser
iw_cm                  49152  1 rdma_cm
ib_cm                 122880  1 rdma_cm
ib_core               360448  4 rdma_cm,iw_cm,ib_iser,ib_cm
iscsi_tcp              24576  0
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               65536  3 libiscsi_tcp,iscsi_tcp,ib_iser
scsi_transport_iscsi   126976  5 libiscsi_tcp,iscsi_tcp,ib_iser,libiscsi
nct7904                20480  0
coretemp               20480  0
drm                   548864  5 drm_kms_helper,drm_ttm_helper,ttm,nouveau
sunrpc                544768  1
ip_tables              32768  2 iptable_filter,iptable_raw
x_tables               49152  7 ebtables,ip6table_filter,ip6table_raw,iptable_filter,ip6_tables,iptable_raw,ip_tables
autofs4                45056  2
btrfs                1331200  0
blake2b_generic        20480  0
xor                    24576  1 btrfs
raid6_pq              114688  1 btrfs
dm_thin_pool           69632  1
dm_persistent_data     73728  1 dm_thin_pool
dm_bio_prison          20480  1 dm_thin_pool
dm_bufio               32768  1 dm_persistent_data
libcrc32c              16384  4 dm_persistent_data,btrfs,libceph,sctp
hid_generic            16384  0
usbmouse               16384  0
usbkbd                 16384  0
usbhid                 57344  0
hid                   135168  2 usbhid,hid_generic
crc32_pclmul           16384  0
ixgbe                 339968  0
xfrm_algo              16384  1 ixgbe
xhci_pci               20480  0
igb                   229376  0
mdio                   16384  1 ixgbe
ahci                   40960  4
xhci_pci_renesas       20480  1 xhci_pci
i2c_i801               32768  0
ehci_pci               20480  0
i2c_algo_bit           16384  2 igb,nouveau
lpc_ich                24576  0
i2c_smbus              20480  1 i2c_i801
dca                    16384  3 igb,ioatdma,ixgbe
xhci_hcd              290816  1 xhci_pci
ehci_hcd               86016  1 ehci_pci
libahci                36864  1 ahci
aacraid               118784  9
wmi                    32768  3 intel_wmi_thunderbolt,mxm_wmi,nouveau

t.lamprecht · Jul 9, 2021

Hmm, weird I did not see any watchdog module I recognized, is there even a watchdog device?

ls -l /dev/watchdog*

If not please try to load the softdog and then start the watchdog-mux again:

Bash:

modprobe softdog
systemctl start watchdog-mux.service

Mad_Max · Jul 9, 2021

 root@vp1:~# ls -l /dev/watchdog*
crw------- 1 root root  10, 130 Jul  8 14:25 /dev/watchdog
crw------- 1 root root 244,   0 Jul  8 14:25 /dev/watchdog0

root@vp1:~# modprobe softdog
root@vp1:~# systemctl start watchdog-mux.service

Did not help, when trying to start the migration, the LRM crashed again.

Code:

root@vp1:~# ha-manager status
quorum OK
master vp4 (active, Fri Jul  9 14:05:52 2021)
lrm master (idle, Fri Jul  9 14:05:54 2021)
lrm vp1 (old timestamp - dead?, Fri Jul  9 14:04:56 2021)
lrm vp2 (active, Fri Jul  9 14:05:48 2021)
lrm vp3 (idle, Fri Jul  9 14:05:54 2021)
lrm vp4 (active, Fri Jul  9 14:05:47 2021)
service vm:100 (vp2, started)
service vm:101 (vp4, started)
service vm:102 (vp4, started)
service vm:103 (vp4, started)
service vm:104 (vp1, migrate)

t.lamprecht · Jul 9, 2021

And what's the systemctl status watchdog-mux.service after you tried to start it again?

Mad_Max · Jul 9, 2021

It did not help (

root@vp1:~# modprobe softdog
root@vp1:~# systemctl start watchdog-mux.service
root@vp1:~# systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
     Active: failed (Result: exit-code) since Fri 2021-07-09 14:21:07 MSK; 2s ago
    Process: 1365323 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
   Main PID: 1365323 (code=exited, status=1/FAILURE)
        CPU: 2ms

Jul 09 14:21:07 vp1 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jul 09 14:21:07 vp1 watchdog-mux[1365323]: watchdog set timeout: Invalid argument
Jul 09 14:21:07 vp1 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 14:21:07 vp1 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.

Mad_Max · Jul 13, 2021

can anyone help me with this?

Mad_Max · Jul 16, 2021

systemctl list-units --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION
● pve-ha-lrm.service   loaded failed failed PVE Local HA Resource Manager Daemon
● watchdog-mux.service loaded failed failed Proxmox VE watchdog multiplexe

can you somehow reinstall these services?

t.lamprecht · Jul 16, 2021

Mad_Max said:
can you somehow reinstall these services?

Yes, with apt install --reinstall pve-ha-manager but I'd be a bit suprised if that'd help - worth a try though.

To me it seems that something is off with your watchdog, which is IMO weird as it seems the softdog is used, that one is simple and also the same on all kernels so if it'd break I'd figure that it's in such a way that much more people would notice.

Mad_Max · Jul 16, 2021

Thomas, you were right, it didn't work, how can I fix this problem? or do I need to reinstall Proxmox entirely? I didn’t want to do this, I have large volumes of OSD disks and I’m afraid of losing them.

Jul 16 10:28:13 vp1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jul 16 10:28:13 vp1 pve-ha-lrm[81465]: starting server
Jul 16 10:28:13 vp1 pve-ha-lrm[81465]: status change startup => wait_for_agent_lock
Jul 16 10:28:13 vp1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jul 16 10:30:09 vp1 pve-ha-lrm[81465]: successfully acquired lock 'ha_agent_vp1_lock'
Jul 16 10:30:09 vp1 pve-ha-lrm[81465]: ERROR: unable to open watchdog socket - No such file or directory
Jul 16 10:30:09 vp1 pve-ha-lrm[81465]: restart LRM, freeze all services
Jul 16 10:30:09 vp1 pve-ha-lrm[81465]: server stopped
Jul 16 10:30:09 vp1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 16 10:30:09 vp1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.

t.lamprecht · Jul 16, 2021

Just so werid, why only on one node? How did they get setup?

Can you also please post all output of:

Bash:

head -n -0 /etc/modprobe.d/*
head -n -0 /lib/modprobe.d/*
head -n -0 /etc/default/pve-ha-manager

Can be long text, so you may need to send it as attachment here.

Also, did you rebooted that node since the upgrade, or that issues happening?

Mad_Max · Jul 16, 2021

Thomas, for me this is also strange, and besides, this node was updated the fourth in a row out of five.
Yes, I have rebooted several times with no other problem besides this one.
Output in attachment

Symon · Aug 6, 2021

Hi, this just happened to me as well. I upgraded three nodes in a cluster as follows.
pve6 nautilus -> pve6 octopus -> pve7 octopus.
The final step broke the watchdog on one of the nodes. All identical hardware. (SuperMicro). Same HW watchdog setting, disabled.
That node doesn't load the softdog module anymore. I tried all the suggestions that Max tried, I get the same errors.
Any ideas what I can do?

Symon · Aug 6, 2021

If I do a

Code:

grep -r softdog /var/log/*

on a working node I see this in the logs following a reboot:-

Code:

syslog:Aug  6 13:03:05 pve-clstr-02 kernel: [   13.261757] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
syslog:Aug  6 13:03:05 pve-clstr-02 kernel: [   13.261761] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0

On the 'broken' node, I see nothing.
Hope that helps diagnose it?

Also on good nodes :-

Code:

root@pve-clstr-02:/etc# dmesg | grep dog
[    0.457309] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[   13.261757] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[   13.261761] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0

On bad nodes

Code:

root@pve-clstr-01:/etc# dmesg | grep dog
[    0.462960] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

Symon · Aug 8, 2021

HAHAHA! Yes! I fixed it. What I lack in talent, I make up for in perseverance.

OK, here's what to do.
I noticed that softdog wasn't starting, and without that, the watchdog-mux.service fails with an 'invalid argument' error message. Without softdog loading, the HA won't work. However, I noticed that there was still a /dev/watchdog0 on my broken node. If I ran modprobe softdog this made a new /dev/watchdog1 which isn't enough to get the watchdog-mux service running.
So, what driver is controlling my rogue /dev/watchdog0 ?

Code:

ls -l /dev/watchdog0

This tells you major/minor numbers for the device, in my case 244:0.
From there look around in /sys/dev/char for these numbers. Then

Code:

readlink /sys/dev/char/244\:0/device/driver
../../../../../bus/i2c/drivers/nct7904

It seems it's loading another watchdog, in my case the driver for the nct7904 device on my system. From there, I found that some idiot (D'oh) had added this module into /etc/modules to try and use it to monitor the motherboard sensors and by removing this entry, all's working again. The module includes a HW watchdog, I presume, which overrides softdog. The nct7904 module must have different set ups, which caused the 'invalid argument' error.
I hope this helps someone!

t.lamprecht · Aug 23, 2021

Symon said:
The module includes a HW watchdog, I presume, which overrides softdog.

Yes, that's the case and quite some HW watchdogs are a bit error prone, that's why we blacklist those all by default with a file shipped by each pve-kenrnel-x.y package generated on compile time. For example check /lib/modprobe.d/blacklist_pve-kernel-5.11.22-3-pve.conf for a list of all modules that would enable a watchdog, if they are manually loaded then the soft dog isn't used by default.

proxmox 7 pve-ha-lrm failed

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Attachments

Member

Member

Member

Proxmox Staff Member