yes. for your system I am currently unsure what triggers it.
yes. for your system I am currently unsure what triggers it.
it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.your system seems to be very overloaded, and there are log messages stating that the HA cycle took almost a minute right before the watchdog expires - since it's the HA stack that keeps the watchdog from expiring, I suspect this to be the cause on your system..

The current version of pve-ha-manager is 5.1.0 which does not contain any of these patches. Also there is no 'testing' version available yet, the patches seem a bit much to all do manual, do we have any timeline when a 5.1.1 would come in testing?It seems like there were some issues that got fixed in the last week regarding ha-manager's update loop, which should update the watchdog timer:
- Bug 7133 - pve-ha-crm: if many HA resources are defined, migration from HA groups to rules may delay update loop (commit)
- The call to update_service_config(...) for the HA resources without
group assignments cause unnecessary updates to the config and can become
costly with higher HA resource counts, which might prevent the CRM to
update its watchdog in time, so skip these updates.- manager: group migration: bulk update changes to resource config (commit)
- The migration process from HA groups to HA rules might require a lot of
small updates to individual HA resource configs. These updates have been
done per-HA resource, which is quite inefficient and can cause the CRM
to fail to update its watchdog in time.
During one of our incidents one of the nodes actually logged something about this:
Code:jan 22 15:09:14 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (56 seconds) jan 22 15:15:19 tp-01-node-a-wp-a pve-ha-crm[2699]: loop take too long (360 seconds)
it seems, but it is absolutely not, our normal cpu load during the day is kept very low, same with memory.
This 'overloaded' is caused purely by io delay on the mounted backup volume, which makes sense is slower during backup windows.
This should however not cause a complete PVE node to reboot because the backup takes a bit longer...
Looking at CPU and Mem graphs in the backup windows, its lower then during usage hours so I do not see why that should cause timeouts in the watchdog.
We use essential cookies to make this site work, and optional cookies to enhance your experience.