Watchdog Reboots

fabian · Jan 27, 2026

XN-Matt said:
I assume you mean @Michiel_1afa?

The host that this happened on for us had 1 VM running.

yes. for your system I am currently unsure what triggers it.

Michiel_1afa · Jan 28, 2026

fabian said:
your system seems to be very overloaded, and there are log messages stating that the HA cycle took almost a minute right before the watchdog expires - since it's the HA stack that keeps the watchdog from expiring, I suspect this to be the cause on your system..

it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.
This 'overloaded' is caused purely by io delay on the mounted backup volume, which makes sense is slower during backup windows.
This should however not cause a complete PVE node to reboot because the backup takes a bit longer...
Looking at CPU and Mem graphs in the backup windows, its lower then during usage hours so I do not see why that should cause timeouts in the watchdog.

katamadone · Jan 28, 2026

We had the same here. Two different server reboots.
- once we haven't seen the watchdog. (no syslog in place then) But all other stuff (corosync etc.) hints and shows the exact behaviour than the second reboot.
- the second time we've seen the watchdog got written via syslog, but not in the journalctl

We had/have currently the feeling it's eventually connected with the pom we use as lxc on an nfs. When the backup of the LXC is running we do have "IO Delay" but not with no other bigger vm backups.
That's why we did change the vzdump conf.

Michiel_1afa · Jan 28, 2026

at0rvk said:
It seems like there were some issues that got fixed in the last week regarding ha-manager's update loop, which should update the watchdog timer:

Bug 7133 - pve-ha-crm: if many HA resources are defined, migration from HA groups to rules may delay update loop (commit)

The call to update_service_config(...) for the HA resources without
group assignments cause unnecessary updates to the config and can become
costly with higher HA resource counts, which might prevent the CRM to
update its watchdog in time, so skip these updates.

manager: group migration: bulk update changes to resource config (commit)

The migration process from HA groups to HA rules might require a lot of
small updates to individual HA resource configs. These updates have been
done per-HA resource, which is quite inefficient and can cause the CRM
to fail to update its watchdog in time.

During one of our incidents one of the nodes actually logged something about this:

Code:

jan 22 15:09:14 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (56 seconds) jan 22 15:15:19 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (360 seconds)

The current version of pve-ha-manager is 5.1.0 which does not contain any of these patches. Also there is no 'testing' version available yet, the patches seem a bit much to all do manual, do we have any timeline when a 5.1.1 would come in testing?

fabian · Jan 28, 2026

the package in question hasn't been bumped yet, once it is the referenced bug will be updated.

Michiel_1afa said:
it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.
This 'overloaded' is caused purely by io delay on the mounted backup volume, which makes sense is slower during backup windows.
This should however not cause a complete PVE node to reboot because the backup takes a bit longer...
Looking at CPU and Mem graphs in the backup windows, its lower then during usage hours so I do not see why that should cause timeouts in the watchdog.

yes, the overload might very well be on the storage level. the log indicates that both pvestatd and the ha loops/cycles take way too long.

XN-Matt · Feb 16, 2026

Do we have anything further on this.

Another watchdog reboot yesterday.

Michiel_1afa · Feb 16, 2026

For my side, had another reboot this friday, 2 PVE hosts. We traced the issue to our ceph HDD pool which is being slow at that moment. Were moving some copy/sync which is going to the HDD pool to be more stretched out and not all hosts hitting it at once, but it still feels silly a node has to reboot because an attached storage is slow.
Still waiting for the patches from comment https://forum.proxmox.com/threads/watchdog-reboots.179523/post-833893 to be released. I will be applying those manually

XN-Matt · Feb 16, 2026

Thanks.

Not seen any slowness on our Ceph pools when it happens. Very little/low usage on the last one. Sub 1/2MB/s.

Michiel_1afa · Mar 2, 2026

New information.

had 2 reboots of hosts this weekend, same host both times, what is weird from the 2nd reboot, we have a very clear "OOM" problem. however our host at the moment has 3x the amount of memory provisioned all vm's combined are allowed to use.

Our backup for this hosts starts at 2am, but at 3:05 when starting to backup an RDS host on this machine, memory usage goes crazy, Around 3:20 it started killing VM's one after another, till the watchdog gave up and rebooted the whole host.

Last night, this also happend, but not to the extend watchdog came in to action:

Please note, we have no stopped vm's that get spun up by the backup, all vm's on this host are active and running, so its not like we start up an amount of vm's at that time.

alexskysilk · Mar 2, 2026

Michiel_1afa said:
Jan 24 02:02:55 pve25 vzdump[2026008]: INFO: Starting Backup of VM 157505 (qemu)

assuming the logs provided are up to the time of death, this is the likely cause. are you backing up to an NFS target?

Michiel_1afa · Mar 2, 2026

alexskysilk said:
assuming the logs provided are up to the time of death, this is the likely cause. are you backing up to an NFS target?

No, thats to a PBS server

alexskysilk · Mar 2, 2026

Michiel_1afa said:
No, thats to a PBS server

are you sharing interfaces between PBS and corosync? if so, break them up.

Michiel_1afa · Mar 2, 2026

alexskysilk said:
are you sharing interfaces between PBS and corosync? if so, break them up.

Yes I am, and no I can not, we have link0 on the storage interfaces, and link1 on the front facing, and I dont have more interfaces available

alexskysilk · Mar 2, 2026

then you need to use QOS and/or limit the bw of your backups. this will happen again.

Watchdog Reboots

fabian

Proxmox Staff Member

Michiel_1afa

Well-Known Member

katamadone

Member

Michiel_1afa

Well-Known Member

fabian

Proxmox Staff Member

XN-Matt

Renowned Member

Michiel_1afa

Well-Known Member

XN-Matt

Renowned Member

Michiel_1afa

Well-Known Member

alexskysilk

Distinguished Member

Michiel_1afa

Well-Known Member

alexskysilk

Distinguished Member

Michiel_1afa

Well-Known Member

alexskysilk

Distinguished Member

We value your privacy