from softdog to ipmi_watchdog without reboot

woodstock

Renowned Member
Feb 18, 2016
47
2
73
Hi everybody,

I'm looking for a way to switch from softdog to ipmi watchdog without restarting my cluster nodes.

pveversion:
Code:
proxmox-ve: 4.4-94 (running kernel: 4.4.76-1-pve)
pve-manager: 4.4-18 (running version: 4.4-18/ef2610e8)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.76-1-pve: 4.4.76-94
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-112
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 0.94.10-1~bpo80+1


I already configured the ipmi module as mentioned in https://pve.proxmox.com/wiki/High_A...PMI_Watchdog_.28module_.22ipmi_watchdog.22.29

Code:
cat /etc/default/pve-ha-manager
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog

# cat /etc/modprobe.d/ipmi_watchdog.conf
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

# cat /etc/default/grub
[…]
GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200n8 nmi_watchdog=0"
[…]


I found a forum thread (Post #7 in https://forum.proxmox.com/threads/total-cluster-reboot-corosync-failures.34681/#post-170705) describing a reload of the softdog module with different parameters.
I took this and changed it to:

Code:
# stop watchdog mux clients:
systemctl stop pve-ha-lrm.service pve-ha-crm.service

# stop watchdog multiplexer to allow removing the watchdog module
systemctl stop watchdog-mux.service

# remove softdog module
rmmod softdog

# load ipmi watchdog module
modprobe ipmi_watchdog

#start services again
systemctl start watchdog-mux.service
systemctl start pve-ha-lrm.service pve-ha-crm.service


Is this the correct way to go?

What makes me nervous is the listing of two watchdog devices in /dev :

Code:
# ll /dev/watchdog*
crw------- 1 root root  10, 130 Oct  5 19:07 /dev/watchdog
crw------- 1 root root 250,   0 Oct  5 19:07 /dev/watchdog0


dmesg shows the NMI watchdog:

Code:
# dmesg | grep watch
[    0.470701] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.


But softdog is currently in use:

Code:
# systemctl status watchdog-mux.service
* watchdog-mux.service - Proxmox VE watchdog multiplexer
   Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
   Active: active (running) since Thu 2017-10-05 19:07:31 CEST; 16h ago
 Main PID: 2243 (watchdog-mux)
   CGroup: /system.slice/watchdog-mux.service
           └─2243 /usr/sbin/watchdog-mux

Oct 05 19:07:31 node3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Oct 05 19:07:31 node3 watchdog-mux[2243]: Watchdog driver 'Software Watchdog', version 0


Do I have to insert the following line into my procedure?

Code:
echo 0 > /proc/sys/kernel/nmi_watchdog


Hoping for some advise ...
 
You don't need to change the NMI watchdog, depending on your setup there might be multiple watchdogs active. But without a reboot, you will not know if your settings are persistent and some hardware watchdogs like to have a reboot after configuration to active its configuration.

As you are using PVE in a cluster, why not move the machines away from the server and do the reboot? In my opinion it defies the purpose of having a cluster in the first place, if I can't migrate hosts and do a reboot of the PVE host.