How work new watchdog in proxmox4?

Melanxolik

Well-Known Member
Dec 18, 2013
86
0
46
I have 3 identical nodes and bought 3 license:
root@cluster-2-1:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 cluster-2-1 (local)
2 1 cluster-2-2
3 1 cluster-2-3
root@cluster-2-1:~#

root@cluster-2-1:~# dmesg |grep -i watch
[ 0.097129] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[ 2.586248] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[ 2.636518] IPMI Watchdog: Unable to register misc device
[ 2.636540] IPMI Watchdog: set timeout error: -22
[ 2.636542] IPMI Watchdog: driver initialized
root@cluster-2-1:~#



NODE1
root@cluster-2-1:~# ipmitool mc watchdog get
Watchdog Timer Use: Reserved (0x00)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 0 sec
Present Countdown: 0 sec
root@cluster-2-1:~#

root@cluster-2-1:~# cat /etc/modprobe.d/impi_watchdog.conf
options ipmi_watchdog action=power_cycle start_now=1
root@cluster-2-1:~#

NODE2:
root@cluster-2-2:~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 10 sec
Present Countdown: 9 sec
root@cluster-2-2:~#

root@cluster-2-2:~# cat /etc/modprobe.d/impi_watchdog.conf
options ipmi_watchdog action=power_cycle start_now=1


NODE3:
root@cluster-2-3:~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 900 sec
Present Countdown: 900 sec
root@cluster-2-3:~#

root@cluster-2-3:~# cat /etc/modprobe.d/impi_watchdog.conf
options ipmi_watchdog action=power_cycle start_now=1


This 3 nodes full identical and don't have difference, only hostname and license to proxmox.

And how i can diagnistic works watchdog module?

Base Board Information
Manufacturer: Supermicro
Product Name: X10SRi-F

I don't understand why watchdog not work on 1,3 nodes
 
Look in the BIOS, there may be some Watchdog specific settings.

A super micro testnode I have also shows the same dmesg output, the watchdog and the
Code:
ipmitool mc watchdog get
command works just fine, did you reboot after installing ipmitool?

For IPMI watchdogs look at:
http://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#IPMI_Watchdog

You can test the watchdog with:
Code:
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
After about a minute executing this command the node should reset.
 
I figured out why it did not work
one of node have:

root@cluster-2-2:~# dpkg -l|grep pve-ha
ii pve-ha-manager 1.0-9 amd64 Proxmox VE HA Manager
root@cluster-2-2:~#




But other two nodes:

root@cluster-2-1:~# dpkg -l|grep pve-ha
rc pve-ha-manager 1.0-9 amd64 Proxmox VE HA Manager
root@cluster-2-1:~#

i reinstalled pve-ha-manager and reboot nodes, after that i have
root@cluster-2-1:~# ll /var/run/watchdog-mux.sock
srw------- 1 root root 0 Oct 20 11:26 /var/run/watchdog-mux.sock
root@cluster-2-1:~#


but, i don't understand this situation:
root@cluster-2-1:~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 10 sec
Present Countdown: 9 sec
root@cluster-2-1:~#

and other node:
root@cluster-2-2:~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 0 sec
root@cluster-2-2:~#

if I get parameter with wich loaded modules:
root@cluster-2-2:~# cat /sys/module/ipmi_watchdog/parameters/action
power_cycle
root@cluster-2-2:~#

I don't undersdand this how it's work, it's new for me
 
Nobody not have answer to this problem?


And I have next problem, if I do:
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
node reboot, but if I disable managment network (ifconfig eth1 down), node can't reboot, I don't understand why watchdog can't reset node if managment network down.

How I can turn on debug mode proxmox and watchdog?
 
Watchdog triggers after 60 seconds of lost quorum, so if the interface were corosync lies is down it will reboot (if a HA enabled service runs on the cluster, else the watchdog isn' active) and corosync has problems with ifdown/ifup pull the plug for test, or kill the corosync process if you do not have access to it.

Or does the echo test doesn't work if you make ifdown on eth1, that would be strange, in that case logs would be appreciated.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!