PVE cluster split-brain

sungl

New Member
Oct 18, 2023
23
0
1
When the PVE cluster reaches a certain scale, it triggers split-brain situations. The forum provides a solution for this, but it is not user-friendly. I would like to confirm why this type of problem occurs.

To reproduce the scenario, one can deploy a PVE environment in bulk using virtual or physical machines. In my case, I have 50 nodes, which are then grouped into a cluster. All these operations proceed normally. However, after restarting all the nodes, a split-brain failure occurs, exhibiting the following symptoms:

  1. All nodes remain in a split-brain state and continuously attempt to elect a leader.
  2. The corosync service utilizes CPU usage of 100% or more.
  3. The internal network of the cluster becomes chaotic, with significant latency spikes, reaching over 500ms in an environment with an expected latency of less than 1ms.
  4. When performing operations such as adding or removing nodes to the cluster at this scale, there is also a chance of triggering the aforementioned issue. The network disruption can be attributed specifically to the actions taken within the cluster, as other causes have been ruled out.

Providing logs and other environmental data repeatedly holds little significance, as this is not an isolated issue. The community has reported it, but there hasn't been a satisfactory resolution. If you wish to reproduce this problem, you can follow the steps I mentioned above, which will consistently reproduce the failure.
 
root@node1:~# pveversion -v proxmox-ve: 8.1.0 (running kernel: 6.5.11-4-pve) pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15) proxmox-kernel-helper: 8.0.9 proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4 proxmox-kernel-6.5: 6.5.11-4 ceph-fuse: 17.2.7-pve1 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx7 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.0 libproxmox-backup-qemu0: 1.4.0 libproxmox-rs-perl: 0.3.1 libpve-access-control: 8.0.7 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.1.0 libpve-guest-common-perl: 5.0.6 libpve-http-server-perl: 5.0.5 libpve-network-perl: 0.9.4 libpve-rs-perl: 0.8.7 libpve-storage-perl: 8.0.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve3 novnc-pve: 1.4.0-3 proxmox-backup-client: 3.0.4-1 proxmox-backup-file-restore: 3.0.4-1 proxmox-kernel-helper: 8.0.9 proxmox-mail-forward: 0.2.2 proxmox-mini-journalreader: 1.4.0 proxmox-offline-mirror-helper: 0.6.2 proxmox-widget-toolkit: 4.1.3 pve-cluster: 8.0.5 pve-container: 5.0.8 pve-docs: 8.1.3 pve-edk2-firmware: 4.2023.08-1 pve-firewall: 5.0.3 pve-firmware: 3.9-1 pve-ha-manager: 4.0.3 pve-i18n: 3.1.2 pve-qemu-kvm: 8.1.2-4 pve-xtermjs: 5.3.0-2 qemu-server: 8.0.10 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.0-pve3
 
Code:
The following configuration is consistent across all nodes.

root@node1:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.11
  }
  node {
    name: node10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.1.20
  }
  node {
    name: node11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.1.21
  }
  node {
    name: node12
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.1.22
  }
  node {
    name: node13
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 192.168.1.23
  }
  node {
    name: node14
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 192.168.1.24
  }
  node {
    name: node15
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 192.168.1.25
  }
  node {
    name: node16
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 192.168.1.26
  }
  node {
    name: node17
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 192.168.1.27
  }
  node {
    name: node18
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 192.168.1.28
  }
  node {
    name: node19
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 192.168.1.29
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.12
  }
  node {
    name: node20
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 192.168.1.30
  }
  node {
    name: node21
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 192.168.1.31
  }
  node {
    name: node22
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 192.168.1.32
  }
  node {
    name: node23
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 192.168.1.33
  }
  node {
    name: node24
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 192.168.1.34
  }
  node {
    name: node25
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 192.168.1.35
  }
  node {
    name: node26
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 192.168.1.36
  }
  node {
    name: node27
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 192.168.1.37
  }
  node {
    name: node28
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 192.168.1.38
  }
  node {
    name: node29
    nodeid: 29
    quorum_votes: 1
    ring0_addr: 192.168.1.39
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.13
  }
  node {
    name: node30
    nodeid: 30
    quorum_votes: 1
    ring0_addr: 192.168.1.40
  }
  node {
    name: node31
    nodeid: 31
    quorum_votes: 1
    ring0_addr: 192.168.1.41
  }
  node {
    name: node32
    nodeid: 32
    quorum_votes: 1
    ring0_addr: 192.168.1.42
  }
  node {
    name: node33
    nodeid: 33
    quorum_votes: 1
    ring0_addr: 192.168.1.43
  }
  node {
    name: node34
    nodeid: 34
    quorum_votes: 1
    ring0_addr: 192.168.1.44
  }
  node {
    name: node35
    nodeid: 35
    quorum_votes: 1
    ring0_addr: 192.168.1.45
  }
  node {
    name: node36
    nodeid: 36
    quorum_votes: 1
    ring0_addr: 192.168.1.46
  }
  node {
    name: node37
    nodeid: 37
    quorum_votes: 1
    ring0_addr: 192.168.1.47
  }
  node {
    name: node38
    nodeid: 38
    quorum_votes: 1
    ring0_addr: 192.168.1.48
  }
  node {
    name: node39
    nodeid: 39
    quorum_votes: 1
    ring0_addr: 192.168.1.49
  }
  node {
    name: node4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.14
  }
  node {
    name: node40
    nodeid: 40
    quorum_votes: 1
    ring0_addr: 192.168.1.50
  }
  node {
    name: node41
    nodeid: 41
    quorum_votes: 1
    ring0_addr: 192.168.1.51
  }
  node {
    name: node42
    nodeid: 42
    quorum_votes: 1
    ring0_addr: 192.168.1.52
  }
  node {
    name: node43
    nodeid: 43
    quorum_votes: 1
    ring0_addr: 192.168.1.53
  }
  node {
    name: node44
    nodeid: 44
    quorum_votes: 1
    ring0_addr: 192.168.1.54
  }
  node {
    name: node45
    nodeid: 45
    quorum_votes: 1
    ring0_addr: 192.168.1.55
  }
  node {
    name: node46
    nodeid: 46
    quorum_votes: 1
    ring0_addr: 192.168.1.56
  }
  node {
    name: node47
    nodeid: 47
    quorum_votes: 1
    ring0_addr: 192.168.1.57
  }
  node {
    name: node48
    nodeid: 48
    quorum_votes: 1
    ring0_addr: 192.168.1.58
  }
  node {
    name: node49
    nodeid: 49
    quorum_votes: 1
    ring0_addr: 192.168.1.59
  }
  node {
    name: node5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.1.15
  }
  node {
    name: node50
    nodeid: 50
    quorum_votes: 1
    ring0_addr: 192.168.1.60
  }
  node {
    name: node6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.1.16
  }
  node {
    name: node7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.1.17
  }
  node {
    name: node8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.1.18
  }
  node {
    name: node9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.1.19
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 50
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Hi,
as @fabian already mentioned in a different thread of yours [0], a cluster of that size is really pushing the limits of what the underlying cluster technology can provide and really needs a fast and low latency network backing it. If that is not the case, the network might become congested and communication breaks down, which, judging from your description, might very well be the cause of your problems. A cluster that tries to reestablish connections is much noisier than a cluster in a mostly idle state, which means that bottlenecks will be seen in that scenario first.

[0] https://forum.proxmox.com/threads/proxmox-ve-cluster-experiences-unexpected-restart.135101/
[1] https://forum.proxmox.com/threads/max-cluster-nodes-with-pve6.55277/
[2] https://forum.proxmox.com/threads/very-large-proxmox-cluster-possibility.55439/
 
Thank you for your reply. I would like to know the recommended range for the number of nodes in a stable cluster environment and the minimum requirements for the network environment.:cool:
 
To reproduce the scenario, one can deploy a PVE environment in bulk using virtual or physical machines.
If you are doing your experiments in virtual environment, you also have to keep in mind the CPU and Disk contention, as well as Hypervisor scheduling. A disk/IO delay could cause the VM OS to be "hung" for a few seconds and that all adds up. This is especially true on massive VM boot which you seem to be causing by rebooting all nodes.
You overall complaint may, perhaps, have merit but testing procedure may be contributing considerably towards the symptoms.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
如果您在虚拟环境中进行实验,您还必须记住 CPU 和磁盘争用以及虚拟机管理程序调度。磁盘/IO 延迟可能会导致虚拟机操作系统“挂起”几秒钟,所有这些都会累积起来。在大规模虚拟机启动时尤其如此,您似乎是通过重新启动所有节点而导致的。
您的总体投诉可能有其道理,但测试程序可能对症状有很大影响。


Blockbridge:Proxmox 的超低延迟全 NVME 共享存储 - https://www.blockbridge.com/proxmox
Yes, both the physical hardware and virtual machines have been tested, and the issues mentioned above have been identified. Currently, the troubleshooting is being replicated in a test environment, and the conclusions drawn are consistent.
 
Unfortunately, we have not done systematic benchmarks for setups of this size, but we know of users that managed to get comparable clusters stable. I would recommend looking at [2] from my last post, where Fabian posted a bit about his own testing and to look through the forum, as others might have posted their experience.
 
Unfortunately, we have not done systematic benchmarks for setups of this size, but we know of users that managed to get comparable clusters stable. I would recommend looking at [2] from my last post, where Fabian posted a bit about his own testing and to look through the forum, as others might have posted their experience.
Thank you. I will continue to keep an eye on it.