NetworkTraffic 1G blocks cluster traffic but 25G Backbone

immo

Renowned Member
Nov 20, 2014
92
0
71
We do have 23 Cluster nodes each with 2x25G or 2x 40G bonding IFs
this bonding IFs took all traffic also clusterfs and storage backend connection
There seems to be one VM which took ~ 800 MBit/s of the links. At the end may 1Gbit/s of the links towards the 100G Switch are in used on one Node connection.

Why this traffic seems to drop the whole cluster ?
If the traffic is heavy for a while we do see disturbed cluster communication VMs can not be started/stopped and the whole cluster feels a bit dizzy.

All nodes have a 1TB RAM and 128 Cores not over booked on memory and cores.

Is there a limitation in the bonding if or the bridges so that the 1G is a liit even with 25G available on the bonds?
 
Hi,

which type of Bonding you use? LACP or LB-RR?
Which traffic goes over the 25 GBit and which over the 40 GBit Interfaces?
 
We use LACP and there are nodes with 2x 25G and some with 2x 40 G towards the redundand Switches. There are running MCLAG.
From this switches the Data net traffic is forwarded to the data traffic network switches and the Storage servers are connected directly to this switch

ALL traffic goes over this links. Storage, Data and corosync. We now that the corosync should be seperated but this wasnt an issue the lat year so it wasnt touched yet.
 
Do you use DAC Cables or SFP with Fiber?
Have you checked the Switches for errors and on the SFP Ports the transmit and receive power?
 
The switches are all ok I also see nothing in in the linux logs except the qmp socket errors
like. But a lot of them if the cluster behaves like something worry.
VM 486 qmp command 'query-proxmox-support' failed - unable to connect to VM 486 qmp socket - timeout after 31 retries

and today a colleague told me this specifc traffic comes for ~ 2 months now... and the problem happen only three times. But if this traffic is stopped.. large SW transfers over a vm which is just a software router with internal vlan interfaces
...
the cluster behaves normally
 
I still recommend you to check the transmitting and receiving power of the SFPs.
If the light output is too weak, this can also lead to such effects.