Something we've seen in version 9 is an increase in watchdog reboots - in fact from none to many.
Last few entries of the journal show:
```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK
```
The node is not losing network connection as far as we know but i'm at a loss to determine if we can isolate why.
Even changed from softdog to iTCO_wdt to see if this helps but it rebooted again today.
It had one active VM at the time, deliberately to see if it would happen with minimal load.
Setup is 4 x ceph nodes and 3 x compute nodes running the VMs. This is one of the latter. Each server has primary/secondary links for WAN, ceph and pve nets. Switch shows no port drops or losses either.
I see this has come up a few times and disabling isn't workable as we use HA in a cluster.. but also this had been ultra stable for over a year before the upgrade to 9.x
Last few entries of the journal show:
```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK
```
The node is not losing network connection as far as we know but i'm at a loss to determine if we can isolate why.
Even changed from softdog to iTCO_wdt to see if this helps but it rebooted again today.
It had one active VM at the time, deliberately to see if it would happen with minimal load.
Setup is 4 x ceph nodes and 3 x compute nodes running the VMs. This is one of the latter. Each server has primary/secondary links for WAN, ceph and pve nets. Switch shows no port drops or losses either.
I see this has come up a few times and disabling isn't workable as we use HA in a cluster.. but also this had been ultra stable for over a year before the upgrade to 9.x
Last edited: