[SOLVED] Mellanox ConnectX-5 EN - 100G running at 40G

Mar 7, 2022
19
1
1
39
Hi there;

I have 3 Proxmox nodes connected via Mellanox 100GbE ConnectX-5 EN QSFP28 cards in cross-connect mode using 3 meter 100G DAC-Cables.

Card is a
MCX516A-CCAT
lspci -vv -s 01:00.0
Bash:
01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
        Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 154
        IOMMU group: 80
        Region 0: Memory at f8000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at fbc00000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [48] Vital Product Data
                Product Name: CX516A - ConnectX-5 QSFP28
                Read-only fields:
                        [PN] Part number: MCX516A-CCAT
                        [EC] Engineering changes: B2
                        [V2] Vendor specific: MCX516A-CCAT
                        [SN] Serial number: REDACTED
                        [V3] Vendor specific: 4299dfae3b13eb118000043f72dc1e64
                        [VA] Vendor specific: MLX:MODL=CX516A:MN=MLNX:CSKU=V2:UUID=V3CI=V0
                        [V0] Vendor specific: PCIeGen3 x16
                        [RV] Reserved: checksum good, 2 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [230 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core


Card is recognized at 40 Gbit/s.

  • Any ideas how to change the speeds to 100 GBit/s ?
  • Am i missing a driver ? first time working with Mellanox cards.
 
Last edited:
Bash:
ethtool enp1s0f0np0
Settings for enp1s0f0np0:
        Supported ports: [ Backplane ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None        RS      BASER
        Advertised link modes:  1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: None       RS      BASER
        Link partner advertised link modes:  Not reported
        Link partner advertised pause frame use: No
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
[B]        Speed: 100000Mb/s[/B]
        Duplex: Full
        Auto-negotiation: on
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes

All 3 nodes show

Speed: 100000Mb/s


Bash:
lshw -class network
  *-network:0
       description: Ethernet interface
       product: MT27800 Family [ConnectX-5]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: enp1s0f0np0
       version: 00
       serial: 04:3f:72:dc:1e:64
       capacity: 40Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.13.19-3-pve duplex=full firmware=16.29.1016 (MT_0000000012) latency=0 link=yes multicast=yes slave=yes
       resources: irq:154 memory:f8000000-f9ffffff memory:fbc00000-fbcfffff

Shows 40 Gbit/s.
Is the readout of lshw wrong ?


This is based on the
3-Node Full mensh Network for Ceph Server guide
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_setup

Bash:
cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp195s0f0
iface enp195s0f0 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr0
#i350 - 1Gbit - Green

iface enx2ad320d35c01 inet manual
#USB - 1Gbit - Purple

auto enp195s0f1
iface enp195s0f1 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
#i350 - 1Gbit - Green

iface enx22b1a71ebb7e inet manual
#https://www.thomas-krenn.com/de/wiki/Virtuelles_Netzwerkinterface_enx_von_Supermicro_Mainboards

auto enp67s0f0
iface enp67s0f0 inet manual
        mtu 9000
#BCM57840 - 10Gbit - Light Blue

auto enp67s0f1
iface enp67s0f1 inet manual
        mtu 9000
#BCM57840 - 10Gbit - Light Blue

iface enx72af4034154b inet manual

auto enp1s0f0np0
iface enp1s0f0np0 inet manual
        mtu 65000
#MT27800(X-5) 100Gbit - Black

auto enp1s0f1np1
iface enp1s0f1np1 inet manual
        mtu 65000
#MT27800(X-5) 100Gbit - Black

auto vlan0
iface vlan0 inet static
        address 10.61.0.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr0
#Port: Corosync

auto vlan1
iface vlan1 inet static
        address 10.61.1.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr1
#Port:Admin PRX01

auto vlan42
iface vlan42 inet static
        address 192.168.42.101/24
        gateway 192.168.42.254
        ovs_type OVSIntPort
        ovs_bridge vmbr1
#Port: WWW PRX01

auto vlan5
iface vlan5 inet static
        address 10.61.5.101/24
        ovs_type OVSIntPort
        ovs_bridge vmbr2
        ovs_mtu 9000
#Port: NAS Backups

auto bond2
iface bond2 inet manual
        ovs_bonds enp67s0f0 enp67s0f1
        ovs_type OVSBond
        ovs_bridge vmbr2
        ovs_mtu 9000
        ovs_options bond_mode=balance-tcp lacp=active
#2x10Gbit/s

auto bond4
iface bond4 inet static
        address 10.61.4.101/24
        bond-slaves enp1s0f0np0 enp1s0f1np1
        bond-miimon 100
        bond-mode broadcast
        mtu 65000
#BOND: ZFS-Sync

auto vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports enp195s0f1 vlan1 vlan42
#Bridge: Administrative-Net

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports enp195s0f0 vlan0
#Bridge: Corosync

auto vmbr2
iface vmbr2 inet manual
        ovs_type OVSBridge
        ovs_ports vlan5 bond2
        ovs_mtu 9000
#Bridge: Clients

Any Iperf3 i run comes down to a summed up bandwith of 21 Gbit/s or less.


Out of curiousity and based on this thread with a X-6:
https://forum.proxmox.com/threads/m...imited-to-bitrate-34gbits-s.96378/post-421663

Might this be the AMD EPYC 7502P 32-Core Processor's C-State ?

Code:
 cat /sys/module/intel_idle/parameters/max_cstate
9
 
Last edited:
Any Iperf3 i run comes down to a summed up bandwith of 21 Gbit/s or less.

do you run iperf3 with multiple paralel connections ?
You'll not be able to reach 100gbit/s with only 1 connection.

Also, iperf3 is single threaded, even with multiple connections, so it could be cpu saturated (1core at 100%).
https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/

iperf2 with multiple connections, use multiple threads, so scale more on cpu side.



about cstate, on my epyc server I'm using grub options:

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable"

to force cpu to max frequency.
 
do you run iperf3 with multiple paralel connections ?
You'll not be able to reach 100gbit/s with only 1 connection.

Also, iperf3 is single threaded, even with multiple connections, so it could be cpu saturated (1core at 100%).
https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/

iperf2 with multiple connections, use multiple threads, so scale more on cpu side.



about cstate, on my epyc server I'm using grub options:

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable"

to force cpu to max frequency.
Just a note that both intel_idle and intel_pstate modules are only active on Intel CPUs (as per the names) so those commands won't do anything on your EPYC server. 'processor.max_cstate=2' might be what you want.
 
So, i ran the following tuning:
https://www.thomas-krenn.com/de/wiki/AMD_EPYC_Performance_Tuning

specifically used: Page 13
https://developer.amd.com/wp-content/resources/56739_Linux Network tuning v0.20.pdf





According to Page 8:
https://developer.amd.com/wp-content/resources/56745_0.75.pdf

"Setting APBDIS (to disable APB) and specifying a fixed SOC P-state of 0 will force the Infinity Fabric and memory controllers into full-power mode"
"DF C-states" Disabled: do not allow Infinity Fabric to go to a low-power state when the processor has entered Cx states
• Enabled: allow Infinity Fabric to go to a low-power state when the processor has entered Cx states

Page 16:
Preferred I/O Settings
Benefit: Allows devices on a single PCIe bus to obtain improved DMA write performance.



AMD CBS > CPU Common Options > Local APIC Mode: x2APIC
AMD CBS > NBio Common Options > SMU Common Options > Determinism Control: Manual
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > Determinism Slider: Performance
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > APBDIS : 1
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > DF Cstates: Disabled
AMD CBS > NBio Common Options > SMU Common Options >SMU Common Options > Fixed SOC PState: P0


The preffered I/O setting is elluding me - seems like it is not available with the Asus Bios RS500A-E10


Resulted in the following speeds:
Bash:
 s1:  Connecting to host 10.61.4.101, port 5101
s3:  Connecting to host 10.61.4.101, port 5103
s2:  Connecting to host 10.61.4.101, port 5102
s1:  [  5] local 10.61.4.102 port 60346 connected to 10.61.4.101 port 5101
s3:  [  5] local 10.61.4.102 port 50850 connected to 10.61.4.101 port 5103
s2:  [  5] local 10.61.4.102 port 34372 connected to 10.61.4.101 port 5102
s1:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s1:  [  5]   0.00-1.00   sec  2.19 GBytes  18.8 Gbits/sec  130   2.21 MBytes
s2:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s2:  [  5]   0.00-1.00   sec  2.07 GBytes  17.8 Gbits/sec  677   2.10 MBytes
s3:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
s3:  [  5]   0.00-1.00   sec  2.11 GBytes  18.2 Gbits/sec  675   1.61 MBytes
s1:  [  5]   1.00-2.00   sec  2.18 GBytes  18.7 Gbits/sec    3   2.50 MBytes
s3:  [  5]   1.00-2.00   sec  2.09 GBytes  18.0 Gbits/sec   37   1.55 MBytes
s2:  [  5]   1.00-2.00   sec  2.15 GBytes  18.5 Gbits/sec  143   2.35 MBytes
s1:  [  5]   2.00-3.00   sec  2.16 GBytes  18.6 Gbits/sec  215   1.77 MBytes
s2:  [  5]   2.00-3.00   sec  2.17 GBytes  18.6 Gbits/sec    4   2.81 MBytes
s3:  [  5]   2.00-3.00   sec  2.09 GBytes  18.0 Gbits/sec   55   1.87 MBytes
s1:  [  5]   3.00-4.00   sec  2.20 GBytes  18.9 Gbits/sec    9   2.36 MBytes
s2:  [  5]   3.00-4.00   sec  2.02 GBytes  17.4 Gbits/sec   87   2.81 MBytes
s3:  [  5]   3.00-4.00   sec  2.20 GBytes  18.9 Gbits/sec    0   2.50 MBytes
s1:  [  5]   4.00-5.00   sec  2.85 GBytes  24.5 Gbits/sec    2   2.72 MBytes
s3:  [  5]   4.00-5.00   sec  1.78 GBytes  15.3 Gbits/sec   29   1.26 MBytes
s2:  [  5]   4.00-5.00   sec  1.80 GBytes  15.5 Gbits/sec    5   1.24 MBytes
s1:  [  5]   5.00-6.00   sec  3.05 GBytes  26.2 Gbits/sec    0   2.95 MBytes
s2:  [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0    961 KBytes
s3:  [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0   1.42 MBytes
s1:  [  5]   6.00-7.00   sec  2.66 GBytes  22.9 Gbits/sec    1   2.97 MBytes
s2:  [  5]   6.00-7.00   sec  1.71 GBytes  14.7 Gbits/sec  135   1.60 MBytes
s3:  [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec  329   1.29 MBytes
s1:  [  5]   7.00-8.00   sec  2.17 GBytes  18.6 Gbits/sec  977    594 KBytes
s2:  [  5]   7.00-8.00   sec  2.25 GBytes  19.3 Gbits/sec    1   2.41 MBytes
s3:  [  5]   7.00-8.00   sec  1.98 GBytes  17.0 Gbits/sec  233   2.15 MBytes
s1:  [  5]   8.00-9.00   sec  2.54 GBytes  21.8 Gbits/sec   81   2.28 MBytes
s3:  [  5]   8.00-9.00   sec  1.92 GBytes  16.5 Gbits/sec    1   2.19 MBytes
s2:  [  5]   8.00-9.00   sec  1.95 GBytes  16.8 Gbits/sec    3   1.13 MBytes
s1:  [  5]   9.00-10.00  sec  2.28 GBytes  19.6 Gbits/sec  484   2.30 MBytes
s1:  - - - - - - - - - - - - - - - - - - - - - - - - -
s1:  [ ID] Interval           Transfer     Bitrate         Retr
s1:  [  5]   0.00-10.00  sec  24.3 GBytes  20.9 Gbits/sec  1902             sender
s1:  [  5]   0.00-10.00  sec  24.3 GBytes  20.9 Gbits/sec                  receiver
s1:
s1:  iperf Done.
s2:  [  5]   9.00-10.00  sec  2.11 GBytes  18.2 Gbits/sec    1   1.59 MBytes
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bitrate         Retr
s2:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec  1056             sender
s2:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec                  receiver
s2:
s2:  iperf Done.
s3:  [  5]   9.00-10.00  sec  2.02 GBytes  17.3 Gbits/sec  163   1.87 MBytes
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bitrate         Retr
s3:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec  1522             sender
s3:  [  5]   0.00-10.00  sec  19.9 GBytes  17.1 Gbits/sec                  receiver
s3:
s3:  iperf Done.

Combined Speeds: ~ 55 Gbit/s of uni-directional transfer

roughly 30 Gbit/s increases, but a far cry from AMD's own test results with a X-5

Bash:
root@PRX02:~# iperf -c 10.61.4.101 -e -P 8
[  3] local 10.61.4.102%bond4 port 59286 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.21 ms)
[  5] local 10.61.4.102%bond4 port 59292 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.16 ms)
[ 12] local 10.61.4.102%bond4 port 59298 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.13 ms)
[  8] local 10.61.4.102%bond4 port 59296 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.23 ms)
[ 15] local 10.61.4.102%bond4 port 59304 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.11 ms)
------------------------------------------------------------
Client connecting to 10.61.4.101, TCP port 5001 with pid 5086 (8 flows)
Write buffer size:  128 KByte
TCP window size: 1.30 MByte (default)
------------------------------------------------------------
[  6] local 10.61.4.102%bond4 port 59290 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.23 ms)
[  4] local 10.61.4.102%bond4 port 59288 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.20 ms)
[ 13] local 10.61.4.102%bond4 port 59300 connected with 10.61.4.101 port 5001 (MSS=8948) (ct=0.13 ms)
[ ID] Interval            Transfer    Bandwidth       Write/Err  Rtry     Cwnd/RTT        NetPwr
[ 13] 0.0000-10.0011 sec  4.77 GBytes  4.10 Gbits/sec  39088/0       1206      550K/260 us  1970258.23
[  5] 0.0000-10.0011 sec  9.24 GBytes  7.93 Gbits/sec  75665/0        507      821K/1588 us  624452.85
[  4] 0.0000-10.0012 sec  9.26 GBytes  7.95 Gbits/sec  75841/0        476      865K/519 us  1915084.38
[  8] 0.0000-10.0010 sec  8.23 GBytes  7.07 Gbits/sec  67395/0        917     1118K/473 us  1867352.24
[ 15] 0.0000-10.0013 sec  9.23 GBytes  7.93 Gbits/sec  75604/0        278      882K/305 us  3248560.58
[  3] 0.0000-10.0015 sec  9.25 GBytes  7.94 Gbits/sec  75762/0        517      637K/196 us  5065630.97
[ 12] 0.0000-10.0004 sec  6.82 GBytes  5.86 Gbits/sec  55878/0        532     1057K/282 us  2597024.02
[  6] 0.0000-10.0005 sec  7.48 GBytes  6.42 Gbits/sec  61254/0       1456      681K/361 us  2223857.22
[ ID] Interval       Transfer     Bandwidth
[SUM] 0.0000-10.0004 sec  64.3 GBytes  55.2 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) = 0.105/0.173/0.227/0.075 ms (tot/err) = 8/0

Also at 55 Gbit/s unidirectionally - but a bunch of retries.

Bash:
cpupower idle-info
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 546
Duration: 93108
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 9892
Duration: 4549089
C2:
Flags/Description: ACPI IOPORT 0x814
Latency: 400
Usage: 15079
Duration: 972196824

If i am reading this correctly the CPU is still entering different idle states and ignoring the bios settings ?



Edit 10-ish:

Followed this guide here: Page 7
http://developer.amd.com/wp-content/resources/56420.pdf

For a HPC cluster with a high performance low latency interconnect such as Mellanox disable the C2 idle state.

Bash:
apt install linux-cpupower
cpupower  idle-set -d 2

Set the CPU governor to ‘performance
Bash:
cpupower frequency-set -g performance

The speed seems to be stable at 55 Gbit/s now. No more retries. And i can finally measure the CPU-usage via Htop .
But a single flow will no longer be > 6 Gbit/s.
[


Sidenote: using 64 iperf-threads from Node 2 and 32 iperf-threads I can get get Node 1 to utilize 78% of its 32C/64T. Node 2 still gets combined 55 Gbit/s and Node 3 gets combined 32 Gbit/s.

Pretty sure that means...


RESULT: Yes - i am CPU-limited using IPERF (1) and having the mellanox-card in a "Linux 'Broadcast' Bond".

Question: Would one of the Openvswitch-Options be less CPU consuming ?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!