Proxmox watchdog - how to increase countdown time

m4rek11

Well-Known Member
Jan 3, 2020
33
1
48
35
Hello

I have problem with watchdog coundown time resseting.

I have enabled watchdog by using: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29

I got:
WATCHDOG_MODULE=ipmi_watchdog

The defaults settings are:

Code:
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec

I would like to increase the "initial countdown", so I do it by:
Code:
ipmiutil wdt -a 1 -t 300

After that, output is:

Code:
ipmiutil wdt ver 3.17
-- BMC version 2.65, IPMI version 2.0
wdt data: 44 03 00 00 64 00 60 00
Watchdog timer is started for use with SMS/OS. Logging
               pretimeout is 0 seconds, pre-action is None
               timeout is 10 seconds, counter is 9 seconds
               action is Power cycle
Setting watchdog timer to 300 seconds ...
wdt data: 44 01 00 00 b8 0b b8 0b
Watchdog timer is started for use with SMS/OS. Logging
               pretimeout is 0 seconds, pre-action is None
               timeout is 300 seconds, counter is 300 seconds
               action is Hard Reset

and by command ipmitool mc watchdog get results are:

Code:
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      300 sec
Present Countdown:      299 sec

Time was change - is ok.

But after stop watchdog-mux service, countdown starts from 10 seconds, not from 300 and settigs were reset again do 10 seconds.

So, my question is, how to permanently increase time to 300 seconds?

Your faithfuly,
Marek.
 
you can't change it, because HA restart the vm after around 1min.

(so if you force watchdog to 300s, the vm could be started on 2 nodes at the same time with data corruption)

Thank you for your answer.
 
Interesting, I'm still wondering if anything has changed, my nodes restart after they lose network connection
 
Last edited:
That's the expected and wanted behavior for a PVE cluster with active High-Availability.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Do you have a question?
Yes, I do have one, so I 've added a redundant link for Corosync, I was hoping if Link 0 goes down, the cluster could rely on Link 1 which should prevent watchdog from restarting the nodes, I guess I was wrong, it didn't turn out that way...... Is there a way to achieve this?

Take for instance 5 nodes, Link 0 Being 172.16.0.0/24 and Link 1 is on 192.168.0.0/24. High Availability Configured.. if Link 0 goes down, communication can still be achieved on Link 1 , so I don't see the point of watchdog initiating a restart.
 
Last edited:
Yes, I do have one, so I 've added a redundant link for Corosync, I was hoping if Link 0 goes down, the cluster could rely on Link 1 which should prevent watchdog from restarting the nodes, I guess I was wrong, it didn't turn out that way...... Is there a way to achieve this?

Yes, that's the way it should work :-)

Of course this requires both links being on independent wires and intact. Take a look at your current status to verify, like so:
Code:
~# corosync-cfgtool  -n
Local node ID 8, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.8->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.8->10.11.16.9) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (10.3.16.8->10.3.16.10) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.8->10.11.16.10) enabled connected mtu: 1397
...
and so on
ALL nodes do list two lines "enabled connected" on your cluster, right?

If one of these rings (=all NICs on ONE link) gets cut off my cluster will stay online. This is the way it works for me.
 
Yes, that's the way it should work :-)

Of course this requires both links being on independent wires and intact. Take a look at your current status to verify, like so:
Code:
~# corosync-cfgtool  -n
Local node ID 8, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.8->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.8->10.11.16.9) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (10.3.16.8->10.3.16.10) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.8->10.11.16.10) enabled connected mtu: 1397
...
and so on
ALL nodes do list two lines "enabled connected" on your cluster, right?

If one of these rings (=all NICs on ONE link) gets cut off my cluster will stay online. This is the way it works for me.
I've been running series of simulation, so it seems if I configure theboth links during the creation of the cluster, I get the same results you have, however if I add the second link after the creation of the cluster by editing the corosync.conf file. Link:1 shows disconnected. Let me run it one more time and be sure I didn't mess anything up. I have to configure this for production but making sure I get it right before I f**** it..