We do have 23 Cluster nodes each with 2x25G or 2x 40G bonding IFs
this bonding IFs took all traffic also clusterfs and storage backend connection
There seems to be one VM which took ~ 800 MBit/s of the links. At the end may 1Gbit/s of the links towards the 100G Switch are in used on one Node connection.
Why this traffic seems to drop the whole cluster ?
If the traffic is heavy for a while we do see disturbed cluster communication VMs can not be started/stopped and the whole cluster feels a bit dizzy.
All nodes have a 1TB RAM and 128 Cores not over booked on memory and cores.
Is there a limitation in the bonding if or the bridges so that the 1G is a liit even with 25G available on the bonds?
this bonding IFs took all traffic also clusterfs and storage backend connection
There seems to be one VM which took ~ 800 MBit/s of the links. At the end may 1Gbit/s of the links towards the 100G Switch are in used on one Node connection.
Why this traffic seems to drop the whole cluster ?
If the traffic is heavy for a while we do see disturbed cluster communication VMs can not be started/stopped and the whole cluster feels a bit dizzy.
All nodes have a 1TB RAM and 128 Cores not over booked on memory and cores.
Is there a limitation in the bonding if or the bridges so that the 1G is a liit even with 25G available on the bonds?