Watchdog Reboots

XN-Matt

Renowned Member
Aug 21, 2017
94
7
73
44
Something we've seen in version 9 is an increase in watchdog reboots - in fact from none to many.

Last few entries of the journal show:

```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK
```

The node is not losing network connection as far as we know but i'm at a loss to determine if we can isolate why.

Even changed from softdog to iTCO_wdt to see if this helps but it rebooted again today.

It had one active VM at the time, deliberately to see if it would happen with minimal load.

Setup is 4 x ceph nodes and 3 x compute nodes running the VMs. This is one of the latter. Each server has primary/secondary links for WAN, ceph and pve nets. Switch shows no port drops or losses either.

I see this has come up a few times and disabling isn't workable as we use HA in a cluster.. but also this had been ultra stable for over a year before the upgrade to 9.x
 
Last edited:
No entries around the time of the reboot.

Code:
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync joined[6]: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [TOTEM ] A new membership (1.759) was formed. Members joined: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] This node is within the primary component and will provide service.
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [MAIN  ] Completed service synchronization, ready to provide service.
-- Boot 6fd6ce7cb48041ec96100b83e762c2cc --
Jan 22 04:42:05 hv-5-i systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 22 04:42:05 hv-5-i (corosync)[1812]: corosync.service: Referenced but unset environment variable evaluates to an empty string: COROSYNC_OPTIONS
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync Cluster Engine  starting up
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog augeas systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Please migrate config file to nodelist.

Server rebooted around 04:39:09 on 22nd Jan.

Networks/NICs as described but this appears to not be the issue.