[SOLVED] Mellanox ConnectX-5 EN - 100G running at 40G

Mar 7, 2022
19
1
1
40
Hi there;

I have 3 Proxmox nodes connected via Mellanox 100GbE ConnectX-5 EN QSFP28 cards in cross-connect mode using 3 meter 100G DAC-Cables.

Card is a
MCX516A-CCAT
lspci -vv -s 01:00.0
Bash:
01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
        Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 154
        IOMMU group: 80
        Region 0: Memory at f8000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at fbc00000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [48] Vital Product Data
                Product Name: CX516A - ConnectX-5 QSFP28
                Read-only fields:
                        [PN] Part number: MCX516A-CCAT
                        [EC] Engineering changes: B2
                        [V2] Vendor specific: MCX516A-CCAT
                        [SN] Serial number: REDACTED
                        [V3] Vendor specific: 4299dfae3b13eb118000043f72dc1e64
                        [VA] Vendor specific: MLX:MODL=CX516A:MN=MLNX:CSKU=V2:UUID=V3CI=V0
                        [V0] Vendor specific: PCIeGen3 x16
                        [RV] Reserved: checksum good, 2 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [230 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core


Card is recognized at 40 Gbit/s.

  • Any ideas how to change the speeds to 100 GBit/s ?
  • Am i missing a driver ? first time working with Mellanox cards.
 
Last edited:
Bash:
ethtool enp1s0f0np0
Settings for enp1s0f0np0:
        Supported ports: [ Backplane ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None        RS      BASER
        Advertised link modes:  1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: None       RS      BASER
        Link partner advertised link modes:  Not reported
        Link partner advertised pause frame use: No
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
[B]        Speed: 100000Mb/s[/B]
        Duplex: Full
        Auto-negotiation: on
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes

All 3 nodes show

Speed: 100000Mb/s


Bash:
lshw -class network
  *-network:0
       description: Ethernet interface
       product: MT27800 Family [ConnectX-5]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: enp1s0f0np0
       version: 00
       serial: 04:3f:72:dc:1e:64
       capacity: 40Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.13.19-3-pve duplex=full firmware=16.29.1016 (MT_0000000012) latency=0 link=yes multicast=yes slave=yes
       resources: irq:154 memory:f8000000-f9ffffff memory:fbc00000-fbcfffff

Shows 40 Gbit/s.
Is the readout of lshw wrong ?


This is based on the
3-Node Full mensh Network for Ceph Server guide
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_setup

Bash:
cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp195s0f0
iface enp195s0f0 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr0
#i350 - 1Gbit - Green

iface enx2ad320d35c01 inet manual
#USB - 1Gbit - Purple

auto enp195s0f1
iface enp195s0f1 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
#i350 - 1Gbit - Green

iface enx22b1a71ebb7e inet manual
#https://www.thomas-krenn.com/de/wiki/Virtuelles_Netzwerkinterface_enx_von_Supermicro_Mainboards

auto enp67s0f0
iface enp67s0f0 inet manual
        mtu 9000
#BCM57840 - 10Gbit - Light Blue

auto enp67s0f1
iface enp67s0f1 inet manual
        mtu 9000
#BCM57840 - 10Gbit - Light Blue

iface enx72af4034154b inet manual

auto enp1s0f0np0
iface enp1s0f0np0 inet manual
        mtu 65000
#MT27800(X-5) 100Gbit - Black

auto enp1s0f1np1
iface enp1s0f1np1 inet manual
        mtu 65000
#MT27800(X-5) 100Gbit - Black

auto vlan0
iface vlan0 inet static
        address 10.61.0.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr0
#Port: Corosync

auto vlan1
iface vlan1 inet static
        address 10.61.1.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr1
#Port:Admin PRX01

auto vlan42
iface vlan42 inet static
        address 192.168.42.101/24
        gateway 192.168.42.254
        ovs_type OVSIntPort
        ovs_bridge vmbr1
#Port: WWW PRX01

auto vlan5
iface vlan5 inet static
        address 10.61.5.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr2
        ovs_mtu 9000
#Port: NAS Backups

auto bond2
iface bond2 inet manual
        ovs_bonds enp67s0f0 enp67s0f1
        ovs_type OVSBond
        ovs_bridge vmbr2
        ovs_mtu 9000
        ovs_options bond_mode=balance-tcp lacp=active
#2x10Gbit/s

auto bond4
iface bond4 inet static
        address 10.61.4.101/24
        bond-slaves enp1s0f0np0 enp1s0f1np1
        bond-miimon 100
        bond-mode broadcast
        mtu 65000
#BOND: ZFS-Sync

auto vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports enp195s0f1 vlan1 vlan42
#Bridge: Administrative-Net

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports enp195s0f0 vlan0
#Bridge: Corosync

auto vmbr2
iface vmbr2 inet manual
        ovs_type OVSBridge
        ovs_ports vlan5 bond2
        ovs_mtu 9000
#Bridge: Clients

Any Iperf3 i run comes down to a summed up bandwith of 21 Gbit/s or less.


Out of curiousity and based on this thread with a X-6:
https://forum.proxmox.com/threads/m...imited-to-bitrate-34gbits-s.96378/post-421663

Might this be the AMD EPYC 7502P 32-Core Processor's C-State ?

Code:
 cat /sys/module/intel_idle/parameters/max_cstate
9
 
Last edited:
Any Iperf3 i run comes down to a summed up bandwith of 21 Gbit/s or less.

do you run iperf3 with multiple paralel connections ?
You'll not be able to reach 100gbit/s with only 1 connection.

Also, iperf3 is single threaded, even with multiple connections, so it could be cpu saturated (1core at 100%).
https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/

iperf2 with multiple connections, use multiple threads, so scale more on cpu side.



about cstate, on my epyc server I'm using grub options:

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable"

to force cpu to max frequency.
 
do you run iperf3 with multiple paralel connections ?
You'll not be able to reach 100gbit/s with only 1 connection.

Also, iperf3 is single threaded, even with multiple connections, so it could be cpu saturated (1core at 100%).
https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/

iperf2 with multiple connections, use multiple threads, so scale more on cpu side.



about cstate, on my epyc server I'm using grub options:

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable"

to force cpu to max frequency.
Just a note that both intel_idle and intel_pstate modules are only active on Intel CPUs (as per the names) so those commands won't do anything on your EPYC server. 'processor.max_cstate=2' might be what you want.
 
So, i ran the following tuning:
https://www.thomas-krenn.com/de/wiki/AMD_EPYC_Performance_Tuning

specifically used: Page 13
https://developer.amd.com/wp-content/resources/56739_Linux Network tuning v0.20.pdf





According to Page 8:
https://developer.amd.com/wp-content/resources/56745_0.75.pdf

"Setting APBDIS (to disable APB) and specifying a fixed SOC P-state of 0 will force the Infinity Fabric and memory controllers into full-power mode"
"DF C-states" Disabled: do not allow Infinity Fabric to go to a low-power state when the processor has entered Cx states
• Enabled: allow Infinity Fabric to go to a low-power state when the processor has entered Cx states

Page 16:
Preferred I/O Settings
Benefit: Allows devices on a single PCIe bus to obtain improved DMA write performance.



AMD CBS > CPU Common Options > Local APIC Mode: x2APIC
AMD CBS > NBio Common Options > SMU Common Options > Determinism Control: Manual
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > Determinism Slider: Performance
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > APBDIS : 1
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > DF Cstates: Disabled
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > Fixed SOC PState: P0


The preffered I/O setting is elluding me - seems like it is not available with the Asus Bios RS500A-E10


Resulted in the following speeds:
Bash:
 s1:  Connecting to host 10.61.4.101, port 5101
s3:  Connecting to host 10.61.4.101, port 5103
s2:  Connecting to host 10.61.4.101, port 5102
s1:  [  5] local 10.61.4.102 port 60346 connected to 10.61.4.101 port 5101
s3:  [  5] local 10.61.4.102 port 50850 connected to 10.61.4.101 port 5103
s2:  [  5] local 10.61.4.102 port 34372 connected to 10.61.4.101 port 5102
s1:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s1:  [  5]   0.00-1.00   sec  2.19 GBytes  18.8 Gbits/sec  130   2.21 MBytes
s2:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s2:  [  5]   0.00-1.00   sec  2.07 GBytes  17.8 Gbits/sec  677   2.10 MBytes
s3:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s3:  [  5]   0.00-1.00   sec  2.11 GBytes  18.2 Gbits/sec  675   1.61 MBytes
s1:  [  5]   1.00-2.00   sec  2.18 GBytes  18.7 Gbits/sec    3   2.50 MBytes
s3:  [  5]   1.00-2.00   sec  2.09 GBytes  18.0 Gbits/sec   37   1.55 MBytes
s2:  [  5]   1.00-2.00   sec  2.15 GBytes  18.5 Gbits/sec  143   2.35 MBytes
s1:  [  5]   2.00-3.00   sec  2.16 GBytes  18.6 Gbits/sec  215   1.77 MBytes
s2:  [  5]   2.00-3.00   sec  2.17 GBytes  18.6 Gbits/sec    4   2.81 MBytes
s3:  [  5]   2.00-3.00   sec  2.09 GBytes  18.0 Gbits/sec   55   1.87 MBytes
s1:  [  5]   3.00-4.00   sec  2.20 GBytes  18.9 Gbits/sec    9   2.36 MBytes
s2:  [  5]   3.00-4.00   sec  2.02 GBytes  17.4 Gbits/sec   87   2.81 MBytes
s3:  [  5]   3.00-4.00   sec  2.20 GBytes  18.9 Gbits/sec    0   2.50 MBytes
s1:  [  5]   4.00-5.00   sec  2.85 GBytes  24.5 Gbits/sec    2   2.72 MBytes
s3:  [  5]   4.00-5.00   sec  1.78 GBytes  15.3 Gbits/sec   29   1.26 MBytes
s2:  [  5]   4.00-5.00   sec  1.80 GBytes  15.5 Gbits/sec    5   1.24 MBytes
s1:  [  5]   5.00-6.00   sec  3.05 GBytes  26.2 Gbits/sec    0   2.95 MBytes
s2:  [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0    961 KBytes
s3:  [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0   1.42 MBytes
s1:  [  5]   6.00-7.00   sec  2.66 GBytes  22.9 Gbits/sec    1   2.97 MBytes
s2:  [  5]   6.00-7.00   sec  1.71 GBytes  14.7 Gbits/sec  135   1.60 MBytes
s3:  [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec  329   1.29 MBytes
s1:  [  5]   7.00-8.00   sec  2.17 GBytes  18.6 Gbits/sec  977    594 KBytes
s2:  [  5]   7.00-8.00   sec  2.25 GBytes  19.3 Gbits/sec    1   2.41 MBytes
s3:  [  5]   7.00-8.00   sec  1.98 GBytes  17.0 Gbits/sec  233   2.15 MBytes
s1:  [  5]   8.00-9.00   sec  2.54 GBytes  21.8 Gbits/sec   81   2.28 MBytes
s3:  [  5]   8.00-9.00   sec  1.92 GBytes  16.5 Gbits/sec    1   2.19 MBytes
s2:  [  5]   8.00-9.00   sec  1.95 GBytes  16.8 Gbits/sec    3   1.13 MBytes
s1:  [  5]   9.00-10.00  sec  2.28 GBytes  19.6 Gbits/sec  484   2.30 MBytes
s1:  - - - - - - - - - - - - - - - - - - - - - - - - -
s1:  [ ID] Interval           Transfer     Bitrate         Retr
s1:  [  5]   0.00-10.00  sec  24.3 GBytes  20.9 Gbits/sec  1902             sender
s1:  [  5]   0.00-10.00  sec  24.3 GBytes  20.9 Gbits/sec                  receiver
s1:
s1:  iperf Done.
s2:  [  5]   9.00-10.00  sec  2.11 GBytes  18.2 Gbits/sec    1   1.59 MBytes
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bitrate         Retr
s2:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec  1056             sender
s2:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec                  receiver
s2:
s2:  iperf Done.
s3:  [  5]   9.00-10.00  sec  2.02 GBytes  17.3 Gbits/sec  163   1.87 MBytes
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bitrate         Retr
s3:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec  1522             sender
s3:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec                  receiver
s3:
s3:  iperf Done.

Combined Speeds: ~ 55 Gbit/s of uni-directional transfer

roughly 30 Gbit/s increases, but a far cry from AMD's own test results with a X-5

Bash:
root@PRX02:~# iperf -c 10.61.4.101 -e -P 8
[  3] local 10.61.4.102%bond4 port 59286 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.21 ms)
[  5] local 10.61.4.102%bond4 port 59292 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.16 ms)
[ 12] local 10.61.4.102%bond4 port 59298 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.13 ms)
[  8] local 10.61.4.102%bond4 port 59296 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.23 ms)
[ 15] local 10.61.4.102%bond4 port 59304 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.11 ms)
------------------------------------------------------------
Client connecting to 10.61.4.101, TCP port 5001 with pid 5086 (8 flows)
Write buffer size:  128 KByte
TCP window size: 1.30 MByte (default)
------------------------------------------------------------
[  6] local 10.61.4.102%bond4 port 59290 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.23 ms)
[  4] local 10.61.4.102%bond4 port 59288 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.20 ms)
[ 13] local 10.61.4.102%bond4 port 59300 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.13 ms)
[ ID] Interval            Transfer    Bandwidth       Write/Err  Rtry     Cwnd/RTT        NetPwr
[ 13] 0.0000-10.0011 sec  4.77 GBytes  4.10 Gbits/sec  39088/0       1206      550K/260 us  1970258.23
[  5] 0.0000-10.0011 sec  9.24 GBytes  7.93 Gbits/sec  75665/0        507      821K/1588 us  624452.85
[  4] 0.0000-10.0012 sec  9.26 GBytes  7.95 Gbits/sec  75841/0        476      865K/519 us  1915084.38
[  8] 0.0000-10.0010 sec  8.23 GBytes  7.07 Gbits/sec  67395/0        917     1118K/473 us  1867352.24
[ 15] 0.0000-10.0013 sec  9.23 GBytes  7.93 Gbits/sec  75604/0        278      882K/305 us  3248560.58
[  3] 0.0000-10.0015 sec  9.25 GBytes  7.94 Gbits/sec  75762/0        517      637K/196 us  5065630.97
[ 12] 0.0000-10.0004 sec  6.82 GBytes  5.86 Gbits/sec  55878/0        532     1057K/282 us  2597024.02
[  6] 0.0000-10.0005 sec  7.48 GBytes  6.42 Gbits/sec  61254/0       1456      681K/361 us  2223857.22
[ ID] Interval       Transfer     Bandwidth
[SUM] 0.0000-10.0004 sec  64.3 GBytes  55.2 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) = 0.105/0.173/0.227/0.075 ms (tot/err) = 8/0

Also at 55 Gbit/s unidirectionally - but a bunch of retries.

Bash:
cpupower idle-info
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 546
Duration: 93108
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 9892
Duration: 4549089
C2:
Flags/Description: ACPI IOPORT 0x814
Latency: 400
Usage: 15079
Duration: 972196824

If i am reading this correctly the CPU is still entering different idle states and ignoring the bios settings ?



Edit 10-ish:

Followed this guide here: Page 7
http://developer.amd.com/wp-content/resources/56420.pdf

For a HPC cluster with a high performance low latency interconnect such as Mellanox disable the C2 idle state.

Bash:
apt install linux-cpupower
cpupower  idle-set -d 2

Set the CPU governor to ‘performance
Bash:
cpupower frequency-set -g performance

The speed seems to be stable at 55 Gbit/s now. No more retries. And i can finally measure the CPU-usage via Htop .
But a single flow will no longer be > 6 Gbit/s.
[


Sidenote: using 64 iperf-threads from Node 2 and 32 iperf-threads I can get get Node 1 to utilize 78% of its 32C/64T. Node 2 still gets combined 55 Gbit/s and Node 3 gets combined 32 Gbit/s.

Pretty sure that means...


RESULT: Yes - i am CPU-limited using IPERF (1) and having the mellanox-card in a "Linux 'Broadcast' Bond".

Question: Would one of the Openvswitch-Options be less CPU consuming ?
 
Last edited: