Temp disable of watchdog (Softdog)

dazza76

Renowned Member
May 25, 2010
41
0
71
I have a scenario where i need to do some upgrades on switches and fibre channel devices , and I know by experience it causes the watchdog to get annoyed and reboot.

Questions:
Is it possible to just stop the Watchdog-mux service and it wont reboot ?
is there another way to temp disable the watchdog process
Increase the timeout to a figure which will allow it to continue like 5 mins.


Cheers
 
I have a scenario where i need to do some upgrades on switches and fibre channel devices , and I know by experience it causes the watchdog to get annoyed and reboot.

Questions:
Is it possible to just stop the Watchdog-mux service and it wont reboot ?
is there another way to temp disable the watchdog process
Increase the timeout to a figure which will allow it to continue like 5 mins.


Cheers
Hi,
so your issue is that the cluster network is down while performing the upgrades? In that case I would recommend to use a second corosync link for redundancy, so nodes don't get fenced.
 
Two other options, beside having stable Corosync network(s):
Disable the HA services. They will be back up running after a reboot though!
First disable the pve-ha-lrm on all nodes, then the pve-ha-crm. Once done, enable them in the same order.

Alternatively, set all HA resources to ingored. After about 10 minutes the CRMs should switch back to idle mode, which means, they won't fence if they lose the cluster connection.
This needs to be done rather manual for now, but with a small CLI oneliner this can be automated:
Code:
for res in $(ha-manager status | grep service | awk '{print $2}'); do echo "setting to 'ignored': $res"; ha-manager set $res --state ignored; done

Once done, set all HA resources back to started:
Code:
for res in $(ha-manager status | grep service | awk '{print $2}'); do echo "setting to 'ignored': $res"; ha-manager set $res --state started; done
 
  • Like
Reactions: Chris
Thanks Aaron , will give that ago , Its actually the FC channel switch that seems to be more sensitive of the switches.
I can replicate by flapping the FC connection manually. i.e
echo “0000:3c:00.0” > /sys/bus/pci/drivers/qla2xxx/unbind
To active port again:
echo “0000:3c:00.0” > /sys/bus/pci/drivers/qla2xxx/bind

if I bounce one a few times the host hard resets.

Yes there are 2 and only 1 flaps at a time but its enough to setoff the Watchdog.

May look at enabling the hardware watchdog as it may give me a little more control.

Cheers
D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!