4.1 ha software watchdog reset does not work

liska_

Member
Nov 19, 2013
115
3
18
Hi,
I try new proxmox 41 in test environment without hardware watchdogs.
Unforunately, I can not make the failed node restart.
I tried few commands I have found on this forum like
ifconfig vmbr1 down
kill -9 corosync

Any node got disconnected, but it stay turned on. Sometimes I got these errors:
watchdog update failed - Broken pipe
pve-ha-lrm lost lock 'ha_agent_sun_lock - can't get cfs lock
unable to write lrm status file - unable to open file '/etc/pve/nodes/sun/lrm_status.tmp.2610' - Device or resource busy

or these
<code>
Jan 5 12:35:04 sun pve-ha-lrm[3028]: successfully acquired lock 'ha_agent_sun_lock'
Jan 5 12:35:04 sun pve-ha-lrm[3028]: watchdog active
Jan 5 12:35:04 sun pve-ha-lrm[3028]: status change wait_for_agent_lock => active
Jan 5 12:35:04 sun watchdog-mux[5077]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5080]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun pve-ha-lrm[5078]: starting service ct:161
Jan 5 12:35:04 sun watchdog-mux[5082]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5085]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>

When I tried
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
I got error: socat[9367] E connect(5, AF=1 "/var/run/watchdog-mux.sock", 28): Connection refused

Where can I found more information why is a node not restarting?
Thanks a lot for your help
 
This i the output:
<code>
proxmox-ve: 4.1-30 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-4 (running version: 4.1-4/ccba54b0)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-4.2.6-1-pve: 4.2.6-30
pve-kernel-3.10.0-1-pve: 3.10.0-5
pve-kernel-2.6.32-23-pve: 2.6.32-109
pve-kernel-4.2.0-1-pve: 4.2.0-13
pve-kernel-4.2.3-1-pve: 4.2.3-18
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-43
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-36
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.2-2
fence-agents-pve: 4.0.20-1
</code>

On two of three nodes were missing fence agents, but nothing has changed after installation. All those nodes were upgraded from older versions of pve like it is described on pve wiki.
 
What kind of watchdog driver do you use. Normally 'softdog' is used, but it seems you loaded something else? Check with dmesg, or send the output of

# lsmod
 
I wanted to try just software watchdog, as these nodes are just desktops. I did not make any changes from default config.
In dmesg I can see this line on all nodes:
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter

This is output of lsmod of one node (other nodes are similar), but I can`t find anything regarding watchdog
<code>
veth 16384 0
ip_set 45056 0
ip6table_filter 16384 0
ip6_tables 28672 1 ip6table_filter
binfmt_misc 20480 1
iptable_filter 16384 0
ip_tables 28672 1 iptable_filter
x_tables 36864 4 ip6table_filter,ip_tables,iptable_filter,ip6_tables
nfsd 319488 13
auth_rpcgss 61440 1 nfsd
nfs_acl 16384 1 nfsd
nfs 258048 0
lockd 94208 2 nfs,nfsd
grace 16384 2 nfsd,lockd
fscache 65536 1 nfs
sunrpc 331776 19 nfs,nfsd,auth_rpcgss,lockd,nfs_acl
ib_iser 53248 0
rdma_cm 45056 1 ib_iser
iw_cm 45056 1 rdma_cm
ib_cm 45056 1 rdma_cm
ib_sa 32768 2 rdma_cm,ib_cm
ib_mad 49152 2 ib_cm,ib_sa
ib_core 102400 6 rdma_cm,ib_cm,ib_sa,iw_cm,ib_mad,ib_iser
ib_addr 20480 2 rdma_cm,ib_core
iscsi_tcp 20480 0
libiscsi_tcp 24576 1 iscsi_tcp
libiscsi 57344 3 libiscsi_tcp,iscsi_tcp,ib_iser
scsi_transport_iscsi 98304 4 iscsi_tcp,ib_iser,libiscsi
nfnetlink_log 20480 1
nfnetlink 16384 3 nfnetlink_log,ip_set
ses 20480 0
enclosure 16384 1 ses
zfs 2813952 5
zunicode 331776 1 zfs
zcommon 57344 1 zfs
znvpair 90112 2 zfs,zcommon
spl 102400 3 zfs,zcommon,znvpair
zavl 16384 1 zfs
snd_hda_codec_hdmi 49152 1
ppdev 20480 0
intel_rapl 20480 0
iosf_mbi 16384 1 intel_rapl
x86_pkg_temp_thermal 16384 0
intel_powerclamp 16384 0
kvm_intel 167936 3
kvm 516096 1 kvm_intel
crct10dif_pclmul 16384 0
snd_hda_codec_realtek 86016 1
crc32_pclmul 16384 0
snd_hda_codec_generic 73728 1 snd_hda_codec_realtek
aesni_intel 167936 0
aes_x86_64 20480 1 aesni_intel
lrw 16384 1 aesni_intel
gf128mul 16384 1 lrw
glue_helper 16384 1 aesni_intel
ablk_helper 16384 1 aesni_intel
cryptd 20480 2 aesni_intel,ablk_helper
i915 1138688 2
snd_hda_intel 36864 0
psmouse 126976 0
snd_hda_codec 135168 4 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_intel
serio_raw 16384 0
pcspkr 16384 0
snd_hda_core 65536 5 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel
drm_kms_helper 126976 1 i915
snd_hwdep 16384 1 snd_hda_codec
snd_pcm 102400 4 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_intel,snd_hda_core
snd_timer 32768 1 snd_pcm
mei_me 36864 0
drm 356352 3 i915,drm_kms_helper
snd 86016 8 snd_hda_codec_realtek,snd_hwdep,snd_timer,snd_hda_codec_hdmi,snd_pcm,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel
mei 102400 1 mei_me
soundcore 16384 1 snd
i2c_i801 24576 0
i2c_algo_bit 16384 1 i915
shpchp 36864 0
lpc_ich 24576 0
parport_pc 32768 0
parport 49152 2 ppdev,parport_pc
8250_fintek 16384 0
soc_button_array 16384 0
video 36864 1 i915
mac_hid 16384 0
tpm_infineon 20480 0
vhost_net 20480 0
vhost 36864 1 vhost_net
macvtap 20480 1 vhost_net
macvlan 24576 1 macvtap
it87 49152 0
hwmon_vid 16384 1 it87
coretemp 16384 0
autofs4 40960 2
uas 24576 0
usb_storage 69632 1 uas
ahci 36864 7
libahci 32768 1 ahci
e1000e 237568 0
ptp 20480 1 e1000e
pps_core 20480 1 ptp
</code>
 
strange, there is not even the softdog module loaded? Does the watchdog-mux service starts correctly?

# systemctl status watchdog-mux.service
 
<code>
systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: start-limit) since Tue 2016-01-05 12:35:04 CET; 1 day 2h ago
Process: 5087 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 5087 (code=exited, status=1/FAILURE)

Jan 05 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 05 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 05 12:35:04 sun watchdog-mux[5087]: watchdog set timeout: Invalid argument
Jan 05 12:35:04 sun systemd[1]: watchdog-mux.service start request repeated too quickly, refusing to start.
Jan 05 12:35:04 sun systemd[1]: Failed to start Proxmox VE watchdog multiplexer.
</code>

And when I tried to restart this service
<code>
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: exit-code) since Wed 2016-01-06 14:56:15 CET; 1s ago
Process: 29844 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 29844 (code=exited, status=1/FAILURE)

Jan 06 14:56:15 sun watchdog-mux[29844]: watchdog set timeout: Invalid argument
Jan 06 14:56:15 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 06 14:56:15 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>
 
Why is the softdog module not loaded? Can you load it manually

# modprobe softdog

Can you start the watchdog-mux service after that?
 
As I said, I have no idea unfortunately;(
lsmod | grep dog
softdog 16384 0

In syslog I can see
kernel: [ 391.830718] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)

I tried to add this module to /etc/modules and reboot but it did not get loaded and there is no logs about it. These logs are the same before and after reboot

<code>
systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: exit-code) since Thu 2016-01-07 10:05:04 CET; 3s ago
Process: 21697 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 21697 (code=exited, status=1/FAILURE)

Jan 07 10:05:04 sun watchdog-mux[21697]: watchdog set timeout: Invalid argument
Jan 07 10:05:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 07 10:05:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>

Thanks a lot for your help
 
I tried to add this module to /etc/modules and reboot but it did not get loaded and there is no logs about it. These logs are the same before and after reboot
if you want to force the load in /etc/modules,
you need to remove softdog module from the blacklist file

/lib/modprobe.d/blacklist_pve-kernel-4.2.x-x-pve.conf
 
Great, that has solved my problem. I just removed it from blacklisted a restarted the node and after killing corosync it got restarted.
But why is that blacklisted by default if it is needed? It is like that on every node.
And what is going to happen after upgrading kernel? This softdog module will be blacklisted again?

<code>
systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: active (running) since Thu 2016-01-07 14:04:08 CET; 1min 46s ago
Main PID: 2454 (watchdog-mux)
CGroup: /system.slice/watchdog-mux.service
└─2454 /usr/sbin/watchdog-mux

Jan 07 14:04:08 sun watchdog-mux[2454]: Watchdog driver 'Software Watchdog', version 0

</code>

Thank you very much Spirit
 
are you sure you don't have define any watchdog in
/etc/default/pve-ha-manager
?

because muxer don't load softdog, if another module is defined in this config file

if (stat(WATCHDOG_DEV, &fs) == -1) {
124 char *wd_module = getenv("WATCHDOG_MODULE");
125 if (wd_module) {
126 char *cmd = NULL;
127 if ((asprintf(&cmd, "modprobe -q %s", wd_module) == -1)) {
128 perror("assemble modprobe command failed");
129 exit(EXIT_FAILURE);
130 }
131 system(cmd);
132 } else {
133 system("modprobe -q softdog"); // load softdog by default
134 }
135 }
 
This is my file /etc/default/pve-ha-manager, it`s the same on all nodes. I just wanted to try softdog, so I did not make any changes in this topic.

<code>
# select watchdog module (default is softdog)
#WATCHDOG_MODULE=ipmi_watchdog
</code>

Is there any place where I could find any logs regarding this?
 
This is my file /etc/default/pve-ha-manager, it`s the same on all nodes. I just wanted to try softdog, so I did not make any changes in this topic.

<code>
# select watchdog module (default is softdog)
#WATCHDOG_MODULE=ipmi_watchdog
</code>

Is there any place where I could find any logs regarding this?

I don't think they are any log from muxer.

le muxer service should do "modprobe -q softdog" , to load the module. (The blacklist don't blacklist manual modprobe, only auto load from /etc/modules).

So, if you can do "modprobe -q softdog", I don't see any reason why muxer can't do it ....
 
Yeah, it is working and no logs anywhere. But now it is working, we will se after upgrade.
 
Hi,
unfortunately it is not working again. I googled for solution but I have not found anything regarding this.
I tried to pull cable off as well as kill corosync or shutdown the interface, but no restart has happened.

When I tried to restart watchdog service without blacklisted module softdog, I got this:
<code>
Jan 18 12:32:20 sun systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory.
Jan 18 12:32:20 sun watchdog-mux[3300]: got terminate request
Jan 18 12:32:20 sun watchdog-mux[3300]: clean exit
Jan 18 12:32:20 sun watchdog-mux[3512]: Watchdog driver 'Software Watchdog', version 0
sun kernel: [ 67.816056] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
</code>
Command modprobe -q softdog do not produce any output and service watchdog-mux.service is running.

When I added softdog to blacklist again, service is in error state and in syslog there are those messages:
<code>
Jan 18 13:23:49 sun systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory.
Jan 18 13:23:49 sun watchdog-mux[5115]: watchdog set timeout: Invalid argument
Jan 18 13:23:49 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 18 13:23:49 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>

I have latest updates from pve-no-sub applied :
proxmox-ve: 4.1-33 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)

Thanks a lot for help
 
hi

i'm also getting these in my logs:

systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory.
systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument

# systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: active (running) since Sun 2016-01-24 21:29:03 EET; 4h 13min ago
Main PID: 2079 (watchdog-mux)
CGroup: /system.slice/watchdog-mux.service
└─2079 /usr/sbin/watchdog-mux

Jan 24 21:29:04 dmz01 watchdog-mux[2079]: Watchdog driver 'Software Watchdog', version 0

# pveversion -v
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-47
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
drbdmanage: 0.91-1
 
hi

i'm also getting these in my logs:

systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory.
systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument

# systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: active (running) since Sun 2016-01-24 21:29:03 EET; 4h 13min ago
Main PID: 2079 (watchdog-mux)
CGroup: /system.slice/watchdog-mux.service
└─2079 /usr/sbin/watchdog-mux

Jan 24 21:29:04 dmz01 watchdog-mux[2079]: Watchdog driver 'Software Watchdog', version 0

# pveversion -v
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-47
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
drbdmanage: 0.91-1

Unit watchdog-mux.socket has been removed in last updates

https://git.proxmox.com/?p=pve-ha-manager.git;a=commit;h=f8a3fc80af299e613c21c9b67e29aee8cc807018

Maybe it has not been disabled.
(systemctl disable watchdog-mux.socket should fix this warning)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!