Temp disable of watchdog (Softdog)

dazza76 · Aug 18, 2023

I have a scenario where i need to do some upgrades on switches and fibre channel devices , and I know by experience it causes the watchdog to get annoyed and reboot.

Questions:
Is it possible to just stop the Watchdog-mux service and it wont reboot ?
is there another way to temp disable the watchdog process
Increase the timeout to a figure which will allow it to continue like 5 mins.

Cheers

Chris · Aug 18, 2023

dazza76 said:
I have a scenario where i need to do some upgrades on switches and fibre channel devices , and I know by experience it causes the watchdog to get annoyed and reboot.

Questions:
Is it possible to just stop the Watchdog-mux service and it wont reboot ?
is there another way to temp disable the watchdog process
Increase the timeout to a figure which will allow it to continue like 5 mins.

Cheers

Hi,
so your issue is that the cluster network is down while performing the upgrades? In that case I would recommend to use a second corosync link for redundancy, so nodes don't get fenced.

aaron · Aug 18, 2023

Two other options, beside having stable Corosync network(s):
Disable the HA services. They will be back up running after a reboot though!
First disable the pve-ha-lrm on all nodes, then the pve-ha-crm. Once done, enable them in the same order.

Alternatively, set all HA resources to ingored. After about 10 minutes the CRMs should switch back to idle mode, which means, they won't fence if they lose the cluster connection.
This needs to be done rather manual for now, but with a small CLI oneliner this can be automated:

Code:

for res in $(ha-manager status | grep service | awk '{print $2}'); do echo "setting to 'ignored': $res"; ha-manager set $res --state ignored; done

Once done, set all HA resources back to started:

Code:

for res in $(ha-manager status | grep service | awk '{print $2}'); do echo "setting to 'ignored': $res"; ha-manager set $res --state started; done

dazza76 · Aug 20, 2023

Thanks Aaron , will give that ago , Its actually the FC channel switch that seems to be more sensitive of the switches.
I can replicate by flapping the FC connection manually. i.e
echo “0000:3c:00.0” > /sys/bus/pci/drivers/qla2xxx/unbind
To active port again:
echo “0000:3c:00.0” > /sys/bus/pci/drivers/qla2xxx/bind

if I bounce one a few times the host hard resets.

Yes there are 2 and only 1 flaps at a time but its enough to setoff the Watchdog.

May look at enabling the hardware watchdog as it may give me a little more control.

Cheers
D

Search

Search

Temp disable of watchdog (Softdog)

dazza76

Renowned Member

Chris

Proxmox Staff Member

aaron

Proxmox Staff Member

dazza76

Renowned Member

We value your privacy