Proxmox nodes rebooting in cluster

cgirlamo · Oct 15, 2025

Hello,

We have had several of our proxmox nodes crash and reboot over the last few weeks. It's always a different node, and generally we'll get a watchdog notice and then the node will crash. Has anyone else encountered similar issues? is there a setting/parameter that we can tweak to fix this?

We are running Proxmox v 9.0, and our nodes are in a data cluster with a total of 11 nodes.

Thanks,

dazza76 · Oct 15, 2025

if you can you should look at using the hardware watchdog then you have lots of options then.

UdoB · Oct 15, 2025

cgirlamo said:
We have had several of our proxmox nodes crash and reboot over the last few weeks.

First question: is High Availability enabled? (Assuming you are running a cluster, which you did not tell us.)

If yes: has corosync a separate ethernet connection? How many rings are established?

If not: there are more possible reasons than I want to list here...

cgirlamo · Oct 15, 2025

dazza76 said:
if you can you should look at using the hardware watchdog then you have lots of options then.

Thank you dazza76 - I'm currently exploring this option

cgirlamo · Oct 15, 2025

UdoB said:
First question: is High Availability enabled? (Assuming you are running a cluster, which you did not tell us.)

If yes: has corosync a separate ethernet connection? How many rings are established?

If not: there are more possible reasons than I want to list here...

Hello UdoB,

Yes, High Availability is enabled, and we are running this as a cluster. Looking through the corosync.conf file, it looks like corosync does not have a seperate ethernet connection, and that there is only one ring established.

Thanks,

UdoB · Oct 15, 2025

cgirlamo said:
it looks like corosync does not have a seperate ethernet connection, and that there is only one ring established.

It may be possible that the corosync connection had a hard time to do its job on a possibly congested wire.

Please search for it, the recommendations/requirements for corosync are clear: low latency at all times + if possible have a second "ring" on a separate wired network. --> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

If it fails then the node (running an lrm) will fence itself --> it reboots hard --> https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

If that happened during the previous boot you should find hints in the journal via journalctl -b -1 -e

Search

Search

Proxmox nodes rebooting in cluster

cgirlamo

New Member

dazza76

Renowned Member

UdoB

Distinguished Member

cgirlamo

New Member

cgirlamo

New Member

UdoB

Distinguished Member

We value your privacy