[SOLVED] softdog reboots while having quorum

grin

Renowned Member
Dec 8, 2008
177
24
83
Hungary
grin.hu
Nov 17 10:09:26 bran corosync[4681]: [TOTEM ] A processor failed, forming new configuration.
Nov 17 10:09:29 bran corosync[4681]: [TOTEM ] A new membership (10.20.30.40:1884) was formed. Members left: 4
Nov 17 10:09:29 bran corosync[4681]: [TOTEM ] Failed to receive the leave message. failed: 4
Nov 17 10:09:29 bran corosync[4681]: [QUORUM] Members[3]: 1 3 2
Nov 17 10:09:29 bran corosync[4681]: [MAIN ] Completed service synchronization, ready to provide service.
...
Nov 17 10:10:04 bran watchdog-mux[2482]: client watchdog expired - disable watchdog updates


(versions / it was being upgraded, that's why the 4.2 / 4.3 mismatches below)
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.2-18 (running version: 4.2-18/158720b9)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.16-1-pve: 4.4.16-64
lvm2: 2.02.164-1
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-86
pve-firmware: 1.1-9
libpve-common-perl: 4.0-79
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-2
pve-container: 1.0-80
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1
 
@dietmar additional info:
current version (Environment 4.3-1/e7cdc165 on this cluster I am reporting from) happily reboots, when:
- doing systemctl restart corosync
or
- doing systemctl restart pve-cluster

with no apparent reason (apart from obvious watchdog timeout), possibly (and hopefully) the same reason you've already fixed.

It is not very lucky for me since https://forum.proxmox.com/threads/all-functions-became-slooow-corosync-problem.30332/ needs corosync restart to unfsck the node but instead of being a "quick fix" it often reboots the node, causing 3-5 minutes of reboot time (smart machine - slow boot, that's called technical advancement).