Hi everybody,
I'm looking for a way to switch from softdog to ipmi watchdog without restarting my cluster nodes.
pveversion:
I already configured the ipmi module as mentioned in https://pve.proxmox.com/wiki/High_A...PMI_Watchdog_.28module_.22ipmi_watchdog.22.29
I found a forum thread (Post #7 in https://forum.proxmox.com/threads/total-cluster-reboot-corosync-failures.34681/#post-170705) describing a reload of the softdog module with different parameters.
I took this and changed it to:
Is this the correct way to go?
What makes me nervous is the listing of two watchdog devices in /dev :
dmesg shows the NMI watchdog:
But softdog is currently in use:
Do I have to insert the following line into my procedure?
Hoping for some advise ...
I'm looking for a way to switch from softdog to ipmi watchdog without restarting my cluster nodes.
pveversion:
Code:
proxmox-ve: 4.4-94 (running kernel: 4.4.76-1-pve)
pve-manager: 4.4-18 (running version: 4.4-18/ef2610e8)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.76-1-pve: 4.4.76-94
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-112
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 0.94.10-1~bpo80+1
I already configured the ipmi module as mentioned in https://pve.proxmox.com/wiki/High_A...PMI_Watchdog_.28module_.22ipmi_watchdog.22.29
Code:
cat /etc/default/pve-ha-manager
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog
# cat /etc/modprobe.d/ipmi_watchdog.conf
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10
# cat /etc/default/grub
[…]
GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200n8 nmi_watchdog=0"
[…]
I found a forum thread (Post #7 in https://forum.proxmox.com/threads/total-cluster-reboot-corosync-failures.34681/#post-170705) describing a reload of the softdog module with different parameters.
I took this and changed it to:
Code:
# stop watchdog mux clients:
systemctl stop pve-ha-lrm.service pve-ha-crm.service
# stop watchdog multiplexer to allow removing the watchdog module
systemctl stop watchdog-mux.service
# remove softdog module
rmmod softdog
# load ipmi watchdog module
modprobe ipmi_watchdog
#start services again
systemctl start watchdog-mux.service
systemctl start pve-ha-lrm.service pve-ha-crm.service
Is this the correct way to go?
What makes me nervous is the listing of two watchdog devices in /dev :
Code:
# ll /dev/watchdog*
crw------- 1 root root 10, 130 Oct 5 19:07 /dev/watchdog
crw------- 1 root root 250, 0 Oct 5 19:07 /dev/watchdog0
dmesg shows the NMI watchdog:
Code:
# dmesg | grep watch
[ 0.470701] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
But softdog is currently in use:
Code:
# systemctl status watchdog-mux.service
* watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: active (running) since Thu 2017-10-05 19:07:31 CEST; 16h ago
Main PID: 2243 (watchdog-mux)
CGroup: /system.slice/watchdog-mux.service
└─2243 /usr/sbin/watchdog-mux
Oct 05 19:07:31 node3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Oct 05 19:07:31 node3 watchdog-mux[2243]: Watchdog driver 'Software Watchdog', version 0
Do I have to insert the following line into my procedure?
Code:
echo 0 > /proc/sys/kernel/nmi_watchdog
Hoping for some advise ...