Watchdog Reboots

XN-Matt

Renowned Member
Aug 21, 2017
102
7
83
44
Something we've seen in version 9 is an increase in watchdog reboots - in fact from none to many.

Last few entries of the journal show:

```
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK
```

The node is not losing network connection as far as we know but i'm at a loss to determine if we can isolate why.

Even changed from softdog to iTCO_wdt to see if this helps but it rebooted again today.

It had one active VM at the time, deliberately to see if it would happen with minimal load.

Setup is 4 x ceph nodes and 3 x compute nodes running the VMs. This is one of the latter. Each server has primary/secondary links for WAN, ceph and pve nets. Switch shows no port drops or losses either.

I see this has come up a few times and disabling isn't workable as we use HA in a cluster.. but also this had been ultra stable for over a year before the upgrade to 9.x
 
Last edited:
No entries around the time of the reboot.

Code:
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Sync joined[6]: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [TOTEM ] A new membership (1.759) was formed. Members joined: 1 2 3 4 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] This node is within the primary component and will provide service.
Jan 20 04:04:06 hv-5-i corosync[1826]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 20 04:04:06 hv-5-i corosync[1826]:   [MAIN  ] Completed service synchronization, ready to provide service.
-- Boot 6fd6ce7cb48041ec96100b83e762c2cc --
Jan 22 04:42:05 hv-5-i systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 22 04:42:05 hv-5-i (corosync)[1812]: corosync.service: Referenced but unset environment variable evaluates to an empty string: COROSYNC_OPTIONS
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync Cluster Engine  starting up
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog augeas systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jan 22 04:42:05 hv-5-i corosync[1812]:   [MAIN  ] Please migrate config file to nodelist.

Server rebooted around 04:39:09 on 22nd Jan.

Networks/NICs as described but this appears to not be the issue.
 
We are experiencing the same issue as described by XN-Matt while upgrading nodes in a cluster one-by-one from 8.4 to 9.1. Just after rebooting one of the nodes as part of completing the dist-upgrade almost all other nodes (mix of v8 and v9) in the cluster fenced themselves:

Code:
2026-01-22T15:10:47+0100 bt-01-node-b-wp-a watchdog-mux[2119]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:47+0100 bt-01-node-b-wp-a watchdog-mux[2119]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:47+0100 bt-01-node-c-wp-a watchdog-mux[2149]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:37+0100 tp-01-node-c-wp-a watchdog-mux[1705]: client watchdog is about to expire
2026-01-22T15:10:47+0100 tp-01-node-c-wp-a watchdog-mux[1705]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:38+0100 tp-01-node-d-wp-a watchdog-mux[1724]: client watchdog is about to expire
2026-01-22T15:10:48+0100 tp-01-node-d-wp-a watchdog-mux[1724]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:37+0100 tp-02-node-a-wp-a watchdog-mux[1761]: client watchdog is about to expire
2026-01-22T15:10:47+0100 tp-02-node-a-wp-a watchdog-mux[1761]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:41+0100 tp-02-node-b-wp-a watchdog-mux[1748]: client watchdog is about to expire
2026-01-22T15:10:51+0100 tp-02-node-b-wp-a watchdog-mux[1748]: client watchdog expired - disable watchdog updates
2026-01-22T15:10:37+0100 tp-02-node-c-wp-a watchdog-mux[1726]: client watchdog is about to expire
2026-01-22T15:10:47+0100 tp-02-node-c-wp-a watchdog-mux[1726]: client watchdog expired - disable watchdog updates

It's also strange that on some nodes `watchdog-mux` didn't even log that it was about to expire.
 
Now you mention it, we saw that on our last upgrade.

Nodes drying during apt-get dist-upgrade. Never seen that before. We just thought it was because they were busy.
 
We have had this same issue since replacing our intel based nodes with amd ones. Lately we have unexpected reboots at least weekly on one or more nodes.
For us this always happens during the backup window (lucky?) and we see high IO delay right before the PVE host decides to shit itself.
Still working on narrowing down the actual cause but with about 0 logging on soft/watchdog its very hard.
 
Would be nice to get some Proxmox involvement as this is clearly an issue with this and other threads of the same problem.
 
We are solely Intel, so I don't think chipset is relevant.
Good to know this affects everyone equally :-)

Would be nice to get some Proxmox involvement as this is clearly an issue with this and other threads of the same problem.

We have had discussion on this topic in the past on this forum, It would be nice to get a way to see the softdog status and get logging of when the watchers decide to NOT ping the watchdog for whatever reason.
So far this whole thing is a big blackbox (at least to me) with servers deciding to -what looks like- random reboot
 
  • Like
Reactions: XN-Matt
It seems like there were some issues that got fixed in the last week regarding ha-manager's update loop, which should update the watchdog timer:

  • Bug 7133 - pve-ha-crm: if many HA resources are defined, migration from HA groups to rules may delay update loop (commit)
    • The call to update_service_config(...) for the HA resources without
      group assignments cause unnecessary updates to the config and can become
      costly with higher HA resource counts, which might prevent the CRM to
      update its watchdog in time, so skip these updates.
  • manager: group migration: bulk update changes to resource config (commit)
    • The migration process from HA groups to HA rules might require a lot of
      small updates to individual HA resource configs. These updates have been
      done per-HA resource, which is quite inefficient and can cause the CRM
      to fail to update its watchdog in time.

During one of our incidents one of the nodes actually logged something about this:


Code:
jan 22 15:09:14 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (56 seconds)
jan 22 15:15:19 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (360 seconds)
 
Last edited:
Interesting. Wonder when those issues were introduced. The number of VMs and HA resources we have has not changed since these issues occurred.

Our issues started mid-Nov.
 
Last edited:
please provide the full logs preceding the unexpected reboot (at least the 10 minutes prior), as well as pveversion -v and the information whether the reboot happened during an upgrade. thanks!
 
Full logs as in the whole of the journal, or just specific to a process?

The usually happens just out of the blue but has during an upgrade too.

pveversion -v
proxmox-ve: 9.1.0 (running kernel: 6.17.4-2-pve)
pve-manager: 9.1.4 (running version: 9.1.4/5ac30304265fbd8e)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.4-2-pve-signed: 6.17.4-2
proxmox-kernel-6.17: 6.17.4-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
proxmox-kernel-6.8: 6.8.12-16
proxmox-kernel-6.8.12-16-pve-signed: 6.8.12-16
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20251111.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.4
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.4
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.1-1
proxmox-backup-file-restore: 4.1.1-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.5
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.1.0
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-5
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.3
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1
 
20 mins before the last reboot
Jan 22 04:18:52 hv-5-i pvestatd[1861]: status update time (6.001 seconds)
Jan 22 04:19:02 hv-5-i pvestatd[1861]: status update time (6.327 seconds)
Jan 22 04:19:19 hv-5-i systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Jan 22 04:19:19 hv-5-i systemd-tmpfiles[2185931]: /usr/lib/tmpfiles.d/legacy.conf:14: Duplicate line for path "/run/lock", ignoring.
Jan 22 04:19:19 hv-5-i systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jan 22 04:19:19 hv-5-i systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Jan 22 04:21:52 hv-5-i pvestatd[1861]: status update time (6.356 seconds)
Jan 22 04:26:36 hv-5-i pmxcfs[1661]: [status] notice: received log
Jan 22 04:26:43 hv-5-i pmxcfs[1661]: [status] notice: received log
Jan 22 04:31:36 hv-5-i pmxcfs[1661]: [status] notice: received log
Jan 22 04:31:42 hv-5-i pmxcfs[1661]: [status] notice: received log
Jan 22 04:34:05 hv-5-i smartd[1495]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 60
Jan 22 04:34:05 hv-5-i smartd[1495]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 64 to 59
Jan 22 04:34:05 hv-5-i smartd[1495]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 64 to 69
Jan 22 04:34:05 hv-5-i smartd[1495]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 66 to 67
Jan 22 04:35:05 hv-5-i systemd[1]: Starting man-db.service - Daily man-db regeneration...
Jan 22 04:35:26 hv-5-i systemd[1]: man-db.service: Deactivated successfully.
Jan 22 04:35:26 hv-5-i systemd[1]: Finished man-db.service - Daily man-db regeneration.
Jan 22 04:38:05 hv-5-i systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jan 22 04:39:09 hv-5-i watchdog-mux[1504]: client watchdog is about to expire
Jan 22 04:39:09 hv-5-i systemd-journald[841]: Received client request to sync journal.
Jan 22 04:38:07 hv-5-i pveupdate[2200016]: <root@pam> starting task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam:
Jan 22 04:38:08 hv-5-i pveupdate[2200071]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 22 04:38:13 hv-5-i pveupdate[2200016]: <root@pam> end task UPID:hv-5-i:00219207:010ADDDA:6971A9AF:aptupdate::root@pam: OK

and one prior to that.

Jan 18 03:59:19 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 03:59:20 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:01 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:01 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:01 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:04 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:05 hv-5-i pvescheduler[1171714]: <root@pam> starting task UPID:hv-5-i:0011E103:08356242:696C5AC5:vzdump::root@pam:
Jan 18 04:00:05 hv-5-i pvescheduler[1171715]: INFO: starting new backup job: vzdump --all 1 --prune-backups 'keep-last=2,keep-monthly=1,kee>
Jan 18 04:00:05 hv-5-i pvescheduler[1171715]: INFO: Starting Backup of VM 117 (qemu)
Jan 18 04:00:09 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:09 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:43 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:00:44 hv-5-i pmxcfs[1575]: [status] notice: received log
Jan 18 04:01:02 hv-5-i pvescheduler[1171715]: INFO: Finished Backup of VM 117 (00:00:57)
Jan 18 04:01:02 hv-5-i pvescheduler[1171715]: INFO: Starting Backup of VM 120 (qemu)
Jan 18 04:02:39 hv-5-i pvestatd[1883]: status update time (9.305 seconds)
Jan 18 04:03:52 hv-5-i systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jan 18 04:03:53 hv-5-i pveupdate[1174857]: <root@pam> starting task UPID:hv-5-i:0011ED50:0835BB50:696C5BA9:aptupdate::root@pam:
Jan 18 04:03:55 hv-5-i pveupdate[1174857]: <root@pam> end task UPID:hv-5-i:0011ED50:0835BB50:696C5BA9:aptupdate::root@pam: command 'apt-get>
Jan 18 04:03:55 hv-5-i systemd[1]: pve-daily-update.service: Deactivated successfully.
Jan 18 04:03:55 hv-5-i systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jan 18 04:03:55 hv-5-i systemd[1]: pve-daily-update.service: Consumed 2.557s CPU time, 143M memory peak.
Jan 18 04:05:16 hv-5-i watchdog-mux[1675]: client watchdog is about to expire
 
was there a backup running in the first case as well?
 
Logs don’t show any.

Just to note, this issue occurs any time of the day. Our backups only run at very specific times. There is no correlation
 
`Journalctl -b -1` (previous boot log) - Cleaned up and anonymized. from ~15 min before restart.

Code:
Jan 24 01:49:28 pve25 vzdump[2026007]: <root@pam> starting task UPID:pve25:001EEA18:06F4E8D3:69741718:vzdump::root@pam:
Jan 24 01:49:29 pve25 vzdump[2026008]: INFO: starting new backup job: vzdump 157501 xx xx xx xx xx xx xx --mailnotification failure --mailto a@b.c --quiet 1 --storage backup-CXX --prune-backups 'keep-all=1' --mode snapshot --remove 0 --lockwait 30
Jan 24 01:49:29 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157501 (qemu)
Jan 24 01:49:50 pve25 pvestatd[2942]: status update time (34.464 seconds)
Jan 24 01:50:24 pve25 pvestatd[2942]: status update time (33.775 seconds)
Jan 24 01:50:58 pve25 pvestatd[2942]: status update time (33.765 seconds)
Jan 24 01:51:01 pve25 CRON[2085012]: (root) CMD (fa-pve ping-health-status >/dev/null 2>&1 || true)
Jan 24 01:51:12 pve25 runuser[2085212]: pam_unix(runuser:session): session opened for user alert(uid=984) by root(uid=0)
Jan 24 01:51:12 pve25 runuser[2085212]: pam_unix(runuser:session): session closed for user alert
Jan 24 01:51:32 pve25 pvestatd[2942]: status update time (34.293 seconds)
Jan 24 01:52:06 pve25 pvestatd[2942]: status update time (34.036 seconds)
Jan 24 01:52:21 pve25 sshd-session[2124486]: Connection closed by xx.xx.xx.xx port 33380 [preauth]
Jan 24 01:52:41 pve25 pvestatd[2942]: status update time (34.557 seconds)
Jan 24 01:53:15 pve25 pvestatd[2942]: status update time (33.919 seconds)
Jan 24 01:53:49 pve25 pvestatd[2942]: status update time (33.755 seconds)
Jan 24 01:54:01 pve25 CRON[2183561]: (root) CMD (fa-pve ping-health-status >/dev/null 2>&1 || true)
Jan 24 01:54:13 pve25 runuser[2188956]: pam_unix(runuser:session): session opened for user alert(uid=984) by root(uid=0)
Jan 24 01:54:13 pve25 runuser[2188956]: pam_unix(runuser:session): session closed for user alert
Jan 24 01:54:23 pve25 pvestatd[2942]: status update time (34.033 seconds)
Jan 24 01:54:58 pve25 pvestatd[2942]: status update time (35.261 seconds)
Jan 24 01:55:01 pve25 CRON[2222882]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 24 01:55:32 pve25 pvestatd[2942]: status update time (34.334 seconds)
Jan 24 01:56:07 pve25 pvestatd[2942]: status update time (35.012 seconds)
Jan 24 01:56:16 pve25 pmxcfs[2723]: [status] notice: received log
Jan 24 01:56:43 pve25 pvestatd[2942]: status update time (35.749 seconds)
Jan 24 01:57:01 pve25 CRON[2281991]: (root) CMD (fa-pve ping-health-status >/dev/null 2>&1 || true)
Jan 24 01:57:13 pve25 runuser[2294779]: pam_unix(runuser:session): session opened for user alert(uid=984) by root(uid=0)
Jan 24 01:57:13 pve25 runuser[2294779]: pam_unix(runuser:session): session closed for user alert
Jan 24 01:57:17 pve25 pvestatd[2942]: status update time (34.327 seconds)
Jan 24 01:57:21 pve25 sshd-session[2301539]: Connection closed by xx.xx.xx.xx port 42064 [preauth]
Jan 24 01:57:53 pve25 pvestatd[2942]: status update time (35.266 seconds)
Jan 24 01:58:27 pve25 pvestatd[2942]: status update time (34.926 seconds)
Jan 24 01:59:02 pve25 pvestatd[2942]: status update time (34.959 seconds)
Jan 24 01:59:21 pve25 pve-ha-lrm[2360716]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - got timeout
Jan 24 01:59:21 pve25 pve-ha-lrm[2360716]: VM 157501 qmp command 'query-status' failed - got timeout
Jan 24 01:59:31 pve25 pve-ha-lrm[2360811]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 01:59:31 pve25 pve-ha-lrm[2360811]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 01:59:40 pve25 pvestatd[2942]: status update time (37.769 seconds)
Jan 24 01:59:42 pve25 pve-ha-lrm[2374993]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 01:59:42 pve25 pve-ha-lrm[2374993]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 01:59:51 pve25 pve-ha-lrm[2380334]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 01:59:51 pve25 pve-ha-lrm[2380334]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:01 pve25 CRON[2380531]: (root) CMD (fa-pve ping-health-status >/dev/null 2>&1 || true)
Jan 24 02:00:01 pve25 pve-ha-lrm[2380457]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:01 pve25 pve-ha-lrm[2380457]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:11 pve25 pve-ha-lrm[2384229]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:11 pve25 pve-ha-lrm[2384229]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:13 pve25 runuser[2393971]: pam_unix(runuser:session): session opened for user alert(uid=984) by root(uid=0)
Jan 24 02:00:13 pve25 runuser[2393971]: pam_unix(runuser:session): session closed for user alert
Jan 24 02:00:17 pve25 pvestatd[2942]: status update time (36.646 seconds)
Jan 24 02:00:22 pve25 pve-ha-lrm[2400007]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:22 pve25 pve-ha-lrm[2400007]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:31 pve25 pve-ha-lrm[2400268]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:31 pve25 pve-ha-lrm[2400268]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:41 pve25 pve-ha-lrm[2400439]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:41 pve25 pve-ha-lrm[2400439]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:51 pve25 pve-ha-lrm[2408069]: VM 157501 qmp command failed - VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:51 pve25 pve-ha-lrm[2408069]: VM 157501 qmp command 'query-status' failed - unable to connect to VM 157501 qmp socket - timeout after 51 retries
Jan 24 02:00:54 pve25 pvestatd[2942]: status update time (37.213 seconds)
Jan 24 02:00:54 pve25 vzdump[2026008]: INFO: Finished Backup of VM 157501 (00:11:25)
Jan 24 02:00:54 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157502 (qemu)
Jan 24 02:01:01 pve25 CRON[2420196]: (root) CMD (if which fa-pve >/dev/null 2>&1; then fa-pve set-qemu-swappiness >/dev/null ; fi)
Jan 24 02:01:12 pve25 vzdump[2026008]: INFO: Finished Backup of VM 157502 (00:00:18)
Jan 24 02:01:12 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157503 (qemu)
Jan 24 02:01:25 pve25 pveproxy[997981]: worker exit
Jan 24 02:01:25 pve25 pveproxy[12371]: worker 997981 finished
Jan 24 02:01:25 pve25 pveproxy[12371]: starting 1 worker(s)
Jan 24 02:01:25 pve25 pveproxy[12371]: worker 2434688 started
Jan 24 02:01:29 pve25 pvestatd[2942]: status update time (34.631 seconds)
Jan 24 02:01:56 pve25 vzdump[2026008]: INFO: Finished Backup of VM 157503 (00:00:44)
Jan 24 02:01:56 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157504 (qemu)
Jan 24 02:02:04 pve25 pvestatd[2942]: status update time (34.946 seconds)
Jan 24 02:02:21 pve25 sshd-session[2460082]: Connection closed by xx.xx.xx.xx port 34958 [preauth]
Jan 24 02:02:39 pve25 pvestatd[2942]: status update time (34.883 seconds)
Jan 24 02:02:55 pve25 vzdump[2026008]: INFO: Finished Backup of VM 157504 (00:00:59)
Jan 24 02:02:55 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157505 (qemu)
Jan 24 02:03:01 pve25 CRON[2480261]: (root) CMD (fa-pve ping-health-status >/dev/null 2>&1 || true)
Jan 24 02:03:13 pve25 runuser[2497436]: pam_unix(runuser:session): session opened for user alert(uid=984) by root(uid=0)
Jan 24 02:03:13 pve25 runuser[2497436]: pam_unix(runuser:session): session closed for user alert
Jan 24 02:03:16 pve25 pvestatd[2942]: status update time (37.505 seconds)
Jan 24 02:04:07 pve25 watchdog-mux[2282]: client watchdog is about to expire
Jan 24 02:04:07 pve25 systemd-journald[1639]: Received client request to sync journal.
Jan 24 02:04:09 pve25 pveproxy[1294414]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:04:09 pve25 pveproxy[2434688]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:04:09 pve25 prometheus-node-exporter[2271]: time=2026-01-24T01:04:09.243Z level=ERROR source=http.go:225 msg="error encoding and sending metric family: write tcp xx.xx.xx.xx:9100->xx.xx.xx.xx:40384: write: broken pipe"
--- snip more prometheus-node-exporter messages ---
Jan 24 02:04:09 pve25 pveproxy[859413]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:04:09 pve25 pve-firewall[2941]: firewall update time (44.933 seconds)
Jan 24 02:04:13 pve25 pve-ha-crm[2979]: loop take too long (58 seconds)
Jan 24 02:04:13 pve25 rsyslogd[2278]: rsyslogd[internal_messages]: 327 messages lost due to rate-limiting (500 allowed within 5 seconds)
Jan 24 02:04:13 pve25 vzdump[2026008]: INFO: Finished Backup of VM 157505 (00:01:18)
Jan 24 02:04:13 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157506 (qemu)
Jan 24 02:04:13 pve25 corosync[2889]:   [MAIN  ] Q empty, queued:0 sent:3706.
Jan 24 02:04:14 pve25 watchdog-mux[2282]: client watchdog was updated before expiring
Jan 24 02:04:23 pve25 pve-ha-lrm[12386]: loop take too long (57 seconds)
Jan 24 02:04:25 pve25 pvestatd[2942]: status update time (68.770 seconds)
Jan 24 02:05:14 pve25 watchdog-mux[2282]: client watchdog is about to expire
Jan 24 02:05:14 pve25 systemd-journald[1639]: Received client request to sync journal.
Jan 24 02:05:24 pve25 watchdog-mux[2282]: client watchdog expired - disable watchdog updates
Jan 24 02:05:25 pve25 watchdog-mux[2282]: exit watchdog-mux with active connections
Jan 24 02:05:25 pve25 systemd-journald[1639]: Received client request to sync journal.
Jan 24 02:05:25 pve25 kernel: watchdog: watchdog0: watchdog did not stop!
Jan 24 02:05:25 pve25 systemd[1]: watchdog-mux.service: Deactivated successfully.
Jan 24 02:05:25 pve25 systemd[1]: watchdog-mux.service: Consumed 32.346s CPU time, 2.6M memory peak.
Jan 24 02:05:28 pve25 pveproxy[1294414]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:05:28 pve25 pveproxy[2434688]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:05:28 pve25 pveproxy[859413]: problem with client ::ffff:xx.xx.xx.xx; Connection reset by peer
Jan 24 02:05:28 pve25 CRON[2520048]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

pveversion -v

Code:
proxmox-ve: 9.1.0 (running kernel: 6.14.11-4-pve)
pve-manager: 9.1.1 (running version: 9.1.1/42db4a6cf33dac83)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.17: 6.17.2-2
proxmox-kernel-6.14.11-4-pve-signed: 6.14.11-4
proxmox-kernel-6.14: 6.14.11-4
amd64-microcode: 3.20250311.1
ceph: 19.2.3-pve2
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20250812.1~deb13u1
ksmtuned: 4.20150326+nmu1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.4
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.0.15
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.3
libpve-rs-perl: 0.11.3
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
openvswitch-switch: 3.5.0-1+b1
proxmox-backup-client: 4.0.20-1
proxmox-backup-file-restore: 4.0.20-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.2
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.1.1
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.0.8
pve-i18n: 3.6.2
pve-qemu-kvm: 10.1.2-4
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.0
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

Havnt updated since our kernel problems - other issue. (kernel 6.14 is pinned rn.) However this is not something new, as mentioned we have had several occurrences of this reboot happen since we enabled HA about 6 months ago. both in version 8 and 9 of PVE.

Looks like in my case the above linked bugfixes might solve the issue on this env.
 
Last edited:
your system seems to be very overloaded, and there are log messages stating that the HA cycle took almost a minute right before the watchdog expires - since it's the HA stack that keeps the watchdog from expiring, I suspect this to be the cause on your system..