Proxmox nodes rebooting in cluster

cgirlamo

New Member
Oct 14, 2025
7
0
1
Hello,

We have had several of our proxmox nodes crash and reboot over the last few weeks. It's always a different node, and generally we'll get a watchdog notice and then the node will crash. Has anyone else encountered similar issues? is there a setting/parameter that we can tweak to fix this?

We are running Proxmox v 9.0, and our nodes are in a data cluster with a total of 11 nodes.

Thanks,
 
if you can you should look at using the hardware watchdog then you have lots of options then.
 
We have had several of our proxmox nodes crash and reboot over the last few weeks.

First question: is High Availability enabled? (Assuming you are running a cluster, which you did not tell us.)

If yes: has corosync a separate ethernet connection? How many rings are established?

If not: there are more possible reasons than I want to list here...
 
First question: is High Availability enabled? (Assuming you are running a cluster, which you did not tell us.)

If yes: has corosync a separate ethernet connection? How many rings are established?

If not: there are more possible reasons than I want to list here...
Hello UdoB,

Yes, High Availability is enabled, and we are running this as a cluster. Looking through the corosync.conf file, it looks like corosync does not have a seperate ethernet connection, and that there is only one ring established.

Thanks,
 
it looks like corosync does not have a seperate ethernet connection, and that there is only one ring established.
It may be possible that the corosync connection had a hard time to do its job on a possibly congested wire.

Please search for it, the recommendations/requirements for corosync are clear: low latency at all times + if possible have a second "ring" on a separate wired network. --> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

If it fails then the node (running an lrm) will fence itself --> it reboots hard --> https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

If that happened during the previous boot you should find hints in the journal via journalctl -b -1 -e
 
  • Like
Reactions: MarkusKo