3x this week already - last saturday, yesterday and this morning
pve-ha-crm decides to die without cause, or at least without a usable message.
"watchdog update failed - Broken pipe"
The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.
journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster
system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.
What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.
As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.
pve-ha-crm decides to die without cause, or at least without a usable message.
"watchdog update failed - Broken pipe"
The server is not doing anything special at that moment, no heavy load, no network issues, no issues on other nodes in the cluster.
journalctl for pve-ha-crm does not show any messages before reboot - btw, the above message "watchdog update failed - Broken pipe" does not get stored in the journal so this makes debugging historic issues even harder.
pve-ha-lrm, pve-cluster and corosync all show no errors, we have 0 lost packages on the network (from switch stats as the pve node just rebooted)
pvestatd has a round time of 5-8 seconds for this cluster
system load at time of reboot was ~10% cpu and ~40% ram. There was a single migration running towards this machine.
What else can I check, after more then 2 years of this issues happening randomly it would now be nice to know what the hell is going on.
As far as software versions go, I did a full update of the cluster yesterday (enterprise repo) hoping to solve this issue.
Last edited: