Watchdog Reboots

XN-Matt · 2026-01-22T14:34:41+0100

Something we've seen in version 9 is an increase in watchdog reboots - in fact from none to many.

Last few entries of the journal show:

```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK
```

The node is not losing network connection as far as we know but i'm at a loss to determine if we can isolate why.

Even changed from softdog to iTCO_wdt to see if this helps but it rebooted again today.

It had one active VM at the time, deliberately to see if it would happen with minimal load.

Setup is 4 x ceph nodes and 3 x compute nodes running the VMs. This is one of the latter. Each server has primary/secondary links for WAN, ceph and pve nets. Switch shows no port drops or losses either.

I see this has come up a few times and disabling isn't workable as we use HA in a cluster.. but also this had been ultra stable for over a year before the upgrade to 9.x

spirit · 2026-01-22T15:08:06+0100

can you send your corosync logs ?

journalctl -u corosync

do you only 1 corosync link ? or dedicated nic ? shared with other network ?

XN-Matt · 2026-01-23T11:38:54+0100

No entries around the time of the reboot.

Code:

Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync joined[6]: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [TOTEM ] A new membership (1.759) was formed. Members joined: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] This node is within the primary component and will provide service.
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [MAIN  ] Completed service synchronization, ready to provide service.
-- Boot 6fd6ce7cb48041ec96100b83e762c2cc --
Jan 22 04:42:05 hv-5-i systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 22 04:42:05 hv-5-i (corosync)[1812]: corosync.service: Referenced but unset environment variable evaluates to an empty string: COROSYNC_OPTIONS
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync Cluster Engine  starting up
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog augeas systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Please migrate config file to nodelist.

Server rebooted around 04:39:09 on 22nd Jan.

Networks/NICs as described but this appears to not be the issue.

Search

Search

Watchdog Reboots

XN-Matt

Renowned Member

spirit

Distinguished Member

XN-Matt

Renowned Member

We value your privacy