[SOLVED] The PVE cluster mysteriously rebooted almost simultaneously

yaotian.zheng · Apr 10, 2025

I used a 3-node group to form a PVE cluster, and when the management gateway address was unreachable (core switch reboot), the three PVE nodes almost automatically restarted at the same time.
By using the 'Journalctl - b -0 | grep reboot' command to query logs on each node: “kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0”“cron[3572]: (CRON) INFO (Running @reboot jobs)”。 By analyzing the logs, it is possible that the kernel module softdog initiated a restart task. May I ask how to prevent this situation from happening again? Attached below are screenshots of my network topology and log query.

SteveITS · Apr 10, 2025

Which network is used for corosync and could they have lost communication with each other?

aj@root · Apr 10, 2025

This is called fencing. If corosync goes down, all nodes that can't establish quorum will reboot within 60s.

When you set up your cluster you'll usually be prompted to add a primary and secondary corosync ring (network connection) so that nodes aren't isolated and fencing doesn't happen.

The best practice would be for each node to have a dedicated corosync NIC, and either a second dedicated corosync NIC, or a priority 7 VLAN over another NIC. The links don't need bandwidth, they just need high-priority/low-latency and can't both go down at the same time.

If it's possible that your primary switch will go down for maintenance, then you will need to have a second switch that has one of those corosync rings going through it for when you're doing maintenance on the primary.

However, nodes that don't have HA active / are not part of HA will not be affected.

Screenshot 2025-04-09 at 10.50.38 PM.png

yaotian.zheng · Apr 10, 2025

SteveITS said:
哪个网络用于 corosync，它们之间是否会失去通信？

蓝色网络（VLAN 231）用于corosync，当core-switch重启时，它们不会失去通信，可以通过二层网络互相通信。

yaotian.zheng · Apr 10, 2025

aj@root said:
This is called fencing. If corosync goes down, all nodes that can't establish quorum will reboot within 60s.

When you set up your cluster you'll usually be prompted to add a primary and secondary corosync ring (network connection) so that nodes aren't isolated and fencing doesn't happen.

The best practice would be for each node to have a dedicated corosync NIC, and either a second dedicated corosync NIC, or a priority 7 VLAN over another NIC. The links don't need bandwidth, they just need high-priority/low-latency and can't both go down at the same time.

If it's possible that your primary switch will go down for maintenance, then you will need to have a second switch that has one of those corosync rings going through it for when you're doing maintenance on the primary.

However, nodes that don't have HA active / are not part of HA will not be affected.

View attachment 84744

Hello, may I ask if it is still possible to add a second corosync ring (network connection) after creating the PVE cluster?

gfngfn256 · Apr 10, 2025

yaotian.zheng said:
Hello, may I ask if it is still possible to add a second corosync ring (network connection) after creating the PVE cluster?

See the Proxmox docs:

5.8.1. Adding Redundant Links To An Existing Cluster

By the way in your diagram I notice 4 PVE nodes (not 3 as your post suggests). This is certainly not a good practice, to maintain quorum. So in your case if 2 nodes go down - your entire cluster will fail. However you could add a fifth node or even a Qdevice to maintain the required quorum.

yaotian.zheng · Apr 10, 2025

gfngfn256 said:
See the Proxmox docs:

5.8.1. Adding Redundant Links To An Existing Cluster

By the way in your diagram I notice 4 PVE nodes (not 3 as your post suggests). This is certainly not a good practice, to maintain quorum. So in your case if 2 nodes go down - your entire cluster will fail. However you could add a fifth node or even a Qdevice to maintain the required quorum.

Thank you very much for your reminder. I added the second corosync ring (network connection) according to the documentation.

aj@root · Apr 11, 2025

@yaotian.zheng I'm glad you got that worked out. Would you mind marking this topic as solved?

yaotian.zheng · Apr 11, 2025

aj@root said:
@yaotian.zheng I'm glad you got that worked out. Would you mind marking this topic as solved?

Completed

Search

Search

[SOLVED] The PVE cluster mysteriously rebooted almost simultaneously

yaotian.zheng

New Member

SteveITS

Active Member

aj@root

Member

yaotian.zheng

New Member

yaotian.zheng

New Member

gfngfn256

Famous Member

5.8.1. Adding Redundant Links To An Existing Cluster

yaotian.zheng

New Member

5.8.1. Adding Redundant Links To An Existing Cluster

aj@root

Member

yaotian.zheng

New Member

We value your privacy

[SOLVED] The PVE cluster mysteriously rebooted almost simultaneously

New Member

Active Member

Member

New Member

New Member

Famous Member

5.8.1. Adding Redundant Links To An Existing Cluster​

​

New Member

5.8.1. Adding Redundant Links To An Existing Cluster​

Member

New Member

We value your privacy

5.8.1. Adding Redundant Links To An Existing Cluster

5.8.1. Adding Redundant Links To An Existing Cluster