PVE 4 with HA

There is an easy way to test if the watchdog works correctly:

# echo 1 >/dev/watchdog

This should trigger a reboot within 60 seconds. Does that work?
 
I just tried your simple test:

root@pve:~# echo 1 >/dev/watchdog
-bash: /dev/watchdog: Device or resource busy

Is that the correct response?
 
Ah, yes - seems watchdog-mux is still running. Try:

# systemctl stop watchdog-mux.service
# echo 1 >/dev/watchdog

 
Hi Dietmar,

The nodes that I am testing PVE HA does not have IPMI port. I am using the softdog of linux.


There is an easy way to test if the watchdog works correctly:

# echo 1 >/dev/watchdog

This should trigger a reboot within 60 seconds. Does that work?


Before unplugging the network cord of node 2, I test the echo 1 > /dev/watchdog on all the 3 nodes (node1, node2 and node3). It says on all the 3 nodes: "-bash: /dev/watchdog: Device or resource busy"
Then I unplugged the network cord of node2. The HA works as it migrates the VMs on node 2 evenly to node1 and node3. After some minutes, I plugged the network cord to node2 again. The membership quorum is finalised but it does not send the VMs back to node 2.

I execute "echo 1 > /dev/watchdog" command on node 2 and it got executed (while on node1 & node2, it says Device or resources busy).

After execution the command echo on node2 I checked the syslog and it fails with the below same error I had before

Aug 6 14:39:53 node2 kernel: [ 1320.827845] watchdog watchdog0: watchdog did not stop!
Aug 6 14:39:57 node2 pve-ha-lrm[1171]: watchdog update failed - Broken pipe

and it keeps repeating the last line "Broken pipe" on node2. It does NOT trigger the reboot within 60 seconds.

Each time I execute the command echo 1 > /dev/watchdog, it repeats this error on syslog.

Is there any parameters to be done on the BIOS

Thanks

Shafeek
 
Hi Mir,

As Dietmar wrote above you might need to do this before:
# systemctl stop watchdog-mux.service
and then
# echo 1 > /dev/watchdog


Thanks for this reply. Stopping the watchdog-mux first does not change anything. It ends up with the same error as previous

Thanks also for the fallback I will check it.

A+

Shafeek