[SOLVED] softdog reboots while having quorum

grin

Renowned Member
Dec 8, 2008
172
21
83
Hungary
grin.hu
Nov 17 10:09:26 bran corosync[4681]: [TOTEM ] A processor failed, forming new configuration.
Nov 17 10:09:29 bran corosync[4681]: [TOTEM ] A new membership (10.20.30.40:1884) was formed. Members left: 4
Nov 17 10:09:29 bran corosync[4681]: [TOTEM ] Failed to receive the leave message. failed: 4
Nov 17 10:09:29 bran corosync[4681]: [QUORUM] Members[3]: 1 3 2
Nov 17 10:09:29 bran corosync[4681]: [MAIN ] Completed service synchronization, ready to provide service.
...
Nov 17 10:10:04 bran watchdog-mux[2482]: client watchdog expired - disable watchdog updates


(versions / it was being upgraded, that's why the 4.2 / 4.3 mismatches below)
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.2-18 (running version: 4.2-18/158720b9)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.16-1-pve: 4.4.16-64
lvm2: 2.02.164-1
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-86
pve-firmware: 1.1-9
libpve-common-perl: 4.0-79
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-2
pve-container: 1.0-80
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1
 
@dietmar additional info:
current version (Environment 4.3-1/e7cdc165 on this cluster I am reporting from) happily reboots, when:
- doing systemctl restart corosync
or
- doing systemctl restart pve-cluster

with no apparent reason (apart from obvious watchdog timeout), possibly (and hopefully) the same reason you've already fixed.

It is not very lucky for me since https://forum.proxmox.com/threads/all-functions-became-slooow-corosync-problem.30332/ needs corosync restart to unfsck the node but instead of being a "quick fix" it often reboots the node, causing 3-5 minutes of reboot time (smart machine - slow boot, that's called technical advancement).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!