Watchdog Reboots

your system seems to be very overloaded, and there are log messages stating that the HA cycle took almost a minute right before the watchdog expires - since it's the HA stack that keeps the watchdog from expiring, I suspect this to be the cause on your system..
it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.
This 'overloaded' is caused purely by io delay on the mounted backup volume, which makes sense is slower during backup windows.
This should however not cause a complete PVE node to reboot because the backup takes a bit longer...
Looking at CPU and Mem graphs in the backup windows, its lower then during usage hours so I do not see why that should cause timeouts in the watchdog.
 
We had the same here. Two different server reboots.
- once we haven't seen the watchdog. (no syslog in place then) But all other stuff (corosync etc.) hints and shows the exact behaviour than the second reboot.
- the second time we've seen the watchdog got written via syslog, but not in the journalctl

We had/have currently the feeling it's eventually connected with the pom we use as lxc on an nfs. When the backup of the LXC is running we do have "IO Delay" but not with no other bigger vm backups.
That's why we did change the vzdump conf.
1769587391312.png
 
It seems like there were some issues that got fixed in the last week regarding ha-manager's update loop, which should update the watchdog timer:

  • Bug 7133 - pve-ha-crm: if many HA resources are defined, migration from HA groups to rules may delay update loop (commit)
    • The call to update_service_config(...) for the HA resources without
      group assignments cause unnecessary updates to the config and can become
      costly with higher HA resource counts, which might prevent the CRM to
      update its watchdog in time, so skip these updates.
  • manager: group migration: bulk update changes to resource config (commit)
    • The migration process from HA groups to HA rules might require a lot of
      small updates to individual HA resource configs. These updates have been
      done per-HA resource, which is quite inefficient and can cause the CRM
      to fail to update its watchdog in time.

During one of our incidents one of the nodes actually logged something about this:


Code:
jan 22 15:09:14 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (56 seconds)
jan 22 15:15:19 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (360 seconds)
The current version of pve-ha-manager is 5.1.0 which does not contain any of these patches. Also there is no 'testing' version available yet, the patches seem a bit much to all do manual, do we have any timeline when a 5.1.1 would come in testing?
 
the package in question hasn't been bumped yet, once it is the referenced bug will be updated.

it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.
This 'overloaded' is caused purely by io delay on the mounted backup volume, which makes sense is slower during backup windows.
This should however not cause a complete PVE node to reboot because the backup takes a bit longer...
Looking at CPU and Mem graphs in the backup windows, its lower then during usage hours so I do not see why that should cause timeouts in the watchdog.

yes, the overload might very well be on the storage level. the log indicates that both pvestatd and the ha loops/cycles take way too long.