[SOLVED] The PVE cluster mysteriously rebooted almost simultaneously

yaotian.zheng

New Member
Apr 10, 2025
5
1
1
I used a 3-node group to form a PVE cluster, and when the management gateway address was unreachable (core switch reboot), the three PVE nodes almost automatically restarted at the same time.
By using the 'Journalctl - b -0 | grep reboot' command to query logs on each node: “kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0”“cron[3572]: (CRON) INFO (Running @reboot jobs)”。 By analyzing the logs, it is possible that the kernel module softdog initiated a restart task. May I ask how to prevent this situation from happening again? Attached below are screenshots of my network topology and log query.
network topology.pnglog1.jpglog2.jpg
 
Which network is used for corosync and could they have lost communication with each other?
 
This is called fencing. If corosync goes down, all nodes that can't establish quorum will reboot within 60s.

When you set up your cluster you'll usually be prompted to add a primary and secondary corosync ring (network connection) so that nodes aren't isolated and fencing doesn't happen.

The best practice would be for each node to have a dedicated corosync NIC, and either a second dedicated corosync NIC, or a priority 7 VLAN over another NIC. The links don't need bandwidth, they just need high-priority/low-latency and can't both go down at the same time.

If it's possible that your primary switch will go down for maintenance, then you will need to have a second switch that has one of those corosync rings going through it for when you're doing maintenance on the primary.

However, nodes that don't have HA active / are not part of HA will not be affected.

Screenshot 2025-04-09 at 10.50.38 PM.png
 
Last edited:
  • Like
Reactions: yaotian.zheng
This is called fencing. If corosync goes down, all nodes that can't establish quorum will reboot within 60s.

When you set up your cluster you'll usually be prompted to add a primary and secondary corosync ring (network connection) so that nodes aren't isolated and fencing doesn't happen.

The best practice would be for each node to have a dedicated corosync NIC, and either a second dedicated corosync NIC, or a priority 7 VLAN over another NIC. The links don't need bandwidth, they just need high-priority/low-latency and can't both go down at the same time.

If it's possible that your primary switch will go down for maintenance, then you will need to have a second switch that has one of those corosync rings going through it for when you're doing maintenance on the primary.

However, nodes that don't have HA active / are not part of HA will not be affected.

View attachment 84744
Hello, may I ask if it is still possible to add a second corosync ring (network connection) after creating the PVE cluster?
 
Hello, may I ask if it is still possible to add a second corosync ring (network connection) after creating the PVE cluster?
See the Proxmox docs:

5.8.1. Adding Redundant Links To An Existing Cluster


By the way in your diagram I notice 4 PVE nodes (not 3 as your post suggests). This is certainly not a good practice, to maintain quorum. So in your case if 2 nodes go down - your entire cluster will fail. However you could add a fifth node or even a Qdevice to maintain the required quorum.

 
  • Like
Reactions: yaotian.zheng
See the Proxmox docs:

5.8.1. Adding Redundant Links To An Existing Cluster


By the way in your diagram I notice 4 PVE nodes (not 3 as your post suggests). This is certainly not a good practice, to maintain quorum. So in your case if 2 nodes go down - your entire cluster will fail. However you could add a fifth node or even a Qdevice to maintain the required quorum.
Thank you very much for your reminder. I added the second corosync ring (network connection) according to the documentation.