Sascha72036

Member
Aug 22, 2016
19
1
23
32
Hello everyone,

we're running a 48 node pve cluster with this setup:
AMD EPYC 7402P, 512GB Memory, Intel X520-DA2 or Mellanox Connect X3 NIC, Ceph Pool with only NVMe, 2x 10Gbit/s interfaces (for cluster traffic) + 2x 1G (for public traffic).
As a few others have recently reported in the forum, there are massive problems with larger clusters (>36 nodes).
The main problem is (probably) a bug in corosync. All nodes start flooding each other with udp floods via the corosync port.
Changing transport to sctp in corosync.conf doesn't seem to be the solution. It resolves the udp flood of course but we run in other problems.

Before we split our cluster: Does anyone have some idea what else we could do?
Or is splitting the cluster the currently best solution?

Best regards
Sascha
 
Last edited:

spirit

Famous Member
Apr 2, 2010
5,520
564
133
www.odiso.com
pveversion -v ?

do you have tried to increase the token timeout value ? something like 10000 for example. (token: 10000).

do you have dedicated link for corosync ? or is it mixed with other traffic ?

does the flood happen only sometime ? all nodes are flooding ? or a specific node is flooding the other ?
 
  • Like
Reactions: Sascha72036

Sascha72036

Member
Aug 22, 2016
19
1
23
32
We have one Dual 10G NIC each node. The 10G NIC is shared between corosync & ceph traffic (seperated in two different vlans).
The floods start for no apparent reason. With sctp instead of knet we have no more floods but after we restart corosync, some NICs in our cluster are resetting due to tx timeout (maybe a problem with mtu discovery).

Not all nodes are flooding at the same time and every time different nodes are blocking the nics because of floods from some other nodes.

Thank you very much for your help. I will try increasing the amount of token, token_coefficient and send_join. What would be a good value here?

pveversion -v:

proxmox-ve: 6.3-1 (running kernel: 5.4.98-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph: 14.2.19-pve1
ceph-fuse: 14.2.19-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-1
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-5
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-10
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

corosync.conf:

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: pvecluster
config_version: 47
interface {
knet_transport: sctp
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
token_coefficient: 3000
version: 2
netmtu: 1500
}
 

spirit

Famous Member
Apr 2, 2010
5,520
564
133
www.odiso.com
be carefull with toek_coefficient, because the real token timeout is compute like this:

https://manpages.debian.org/buster/corosync/corosync.conf.5.en.html

Code:
 real token timeout is then computed as token + (number_of_nodes - 2) * token_coefficient.

default token value is 3000 since last corosync update (it was 1000 previously).

but with your token_coefficient:3000 && 48 nodes, you are already around: 81000ms... that seem pretty high. (not sure about the impact).

corosync dev recommand to only change base token value, something between 3000-10000 should be enough. (I'm running 20 nodes clusters with token:1000 without any problem)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!