Hello everyone,
we're running a 48 node pve cluster with this setup:
AMD EPYC 7402P, 512GB Memory, Intel X520-DA2 or Mellanox Connect X3 NIC, Ceph Pool with only NVMe, 2x 10Gbit/s interfaces (for cluster traffic) + 2x 1G (for public traffic).
As a few others have recently reported in the forum, there are massive problems with larger clusters (>36 nodes).
The main problem is (probably) a bug in corosync. All nodes start flooding each other with udp floods via the corosync port.
Changing transport to sctp in corosync.conf doesn't seem to be the solution. It resolves the udp flood of course but we run in other problems.
Before we split our cluster: Does anyone have some idea what else we could do?
Or is splitting the cluster the currently best solution?
Best regards
Sascha
we're running a 48 node pve cluster with this setup:
AMD EPYC 7402P, 512GB Memory, Intel X520-DA2 or Mellanox Connect X3 NIC, Ceph Pool with only NVMe, 2x 10Gbit/s interfaces (for cluster traffic) + 2x 1G (for public traffic).
As a few others have recently reported in the forum, there are massive problems with larger clusters (>36 nodes).
The main problem is (probably) a bug in corosync. All nodes start flooding each other with udp floods via the corosync port.
Changing transport to sctp in corosync.conf doesn't seem to be the solution. It resolves the udp flood of course but we run in other problems.
Before we split our cluster: Does anyone have some idea what else we could do?
Or is splitting the cluster the currently best solution?
Best regards
Sascha
Last edited: