When the PVE cluster reaches a certain scale, it triggers split-brain situations. The forum provides a solution for this, but it is not user-friendly. I would like to confirm why this type of problem occurs.
To reproduce the scenario, one can deploy a PVE environment in bulk using virtual or physical machines. In my case, I have 50 nodes, which are then grouped into a cluster. All these operations proceed normally. However, after restarting all the nodes, a split-brain failure occurs, exhibiting the following symptoms:
Providing logs and other environmental data repeatedly holds little significance, as this is not an isolated issue. The community has reported it, but there hasn't been a satisfactory resolution. If you wish to reproduce this problem, you can follow the steps I mentioned above, which will consistently reproduce the failure.
To reproduce the scenario, one can deploy a PVE environment in bulk using virtual or physical machines. In my case, I have 50 nodes, which are then grouped into a cluster. All these operations proceed normally. However, after restarting all the nodes, a split-brain failure occurs, exhibiting the following symptoms:
- All nodes remain in a split-brain state and continuously attempt to elect a leader.
- The corosync service utilizes CPU usage of 100% or more.
- The internal network of the cluster becomes chaotic, with significant latency spikes, reaching over 500ms in an environment with an expected latency of less than 1ms.
- When performing operations such as adding or removing nodes to the cluster at this scale, there is also a chance of triggering the aforementioned issue. The network disruption can be attributed specifically to the actions taken within the cluster, as other causes have been ruled out.
Providing logs and other environmental data repeatedly holds little significance, as this is not an isolated issue. The community has reported it, but there hasn't been a satisfactory resolution. If you wish to reproduce this problem, you can follow the steps I mentioned above, which will consistently reproduce the failure.