Mellanox Connect-X 6 100G is limited to Bitrate ~34Gbits/s

dan.ger

Well-Known Member
May 13, 2019
83
7
48
We have 3 Nodes (Proxmox 6.4-13 latest version) with Mellanox dual port Connect-x 6 cards 100G connected as mesh network with mode eth and ROCEv2, driver OFED-5.4-1.0.3. The uses PCI x16 gen 3.0 8GB/s. MTU is configured to 9000, so they should have more throughput.

Code:
3b:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
        Subsystem: Mellanox Technologies MT28908 Family [ConnectX-6]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 922
        NUMA node: 0
        Region 0: Memory at ae000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at ab000000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed unknown, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [48] Vital Product Data
                Product Name: ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), dual-port QSFP56
                Read-only fields:
                        [PN] Part number: MCX653106A-ECAT
                        [EC] Engineering changes: AD
                        [V2] Vendor specific: MCX653106A-ECAT
                        [SN] Serial number: xxxxxxxxxxxx
                        [V3] Vendor specific: d8bf42051f87eb118000b8cef65d458e
                        [VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653106A
                        [V0] Vendor specific: PCIeGen4 x16
                        [RV] Reserved: checksum good, 1 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1c0 v1] #19
        Capabilities: [230 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [320 v1] #27
        Capabilities: [370 v1] #26
        Capabilities: [420 v1] #25
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core

If i start a iperf3 measurement I can only reach ~30 Gbits/s within the cluster, regardless wich node is the server:

Code:
Connecting to host 10.10.10.3, port 5201
[  5] local 10.10.10.1 port 59730 connected to 10.10.10.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.78 GBytes  32.4 Gbits/sec    0   1.80 MBytes
[  5]   1.00-2.00   sec  4.67 GBytes  40.1 Gbits/sec    0   2.74 MBytes
[  5]   2.00-3.00   sec  3.27 GBytes  28.1 Gbits/sec    0   2.74 MBytes
[  5]   3.00-4.00   sec  3.61 GBytes  31.0 Gbits/sec    0   2.74 MBytes
[  5]   4.00-5.00   sec  3.55 GBytes  30.5 Gbits/sec    0   2.74 MBytes
[  5]   5.00-6.00   sec  4.81 GBytes  41.3 Gbits/sec    0   2.74 MBytes
[  5]   6.00-7.00   sec  4.72 GBytes  40.5 Gbits/sec    0   2.74 MBytes
[  5]   7.00-8.00   sec  3.92 GBytes  33.7 Gbits/sec    0   3.04 MBytes
[  5]   8.00-9.00   sec  3.53 GBytes  30.3 Gbits/sec    0   3.04 MBytes
[  5]   9.00-10.00  sec  3.52 GBytes  30.2 Gbits/sec    0   3.04 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  39.4 GBytes  33.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  39.4 GBytes  33.8 Gbits/sec                  receiver

So what can I do to increase throughput for ceph cluster network?
 
Last edited:
Are you sure that iperf3 is not cpu limited ? If I remember, iperf3 is limited to 1core by stream, vs ifperf2 can use multiple cores.

do you have tried with -P option with iperf2 ? or launch multiple iperf3.

also, try to increase the window size with -w, as if you send a lot of small packets, you could it core limited too with only 1 steam.
 
Thanks for the hint, now I get round about 54 gbits/s with iperf3 -P 24 -l 64K -w 256K. As I understand the 24 streams belongs to one processor to get entire speed I need to run multiple iperf servers on different ports.
 
Thanks for the hint, now I get round about 54 gbits/s with iperf3 -P 24 -l 64K -w 256K. As I understand the 24 streams belongs to one processor to get entire speed I need to run multiple iperf servers on different ports.
yes. I have some big streaming server at work with 2x100gb, it's really not easy to reach too much bandwith, you need to be carefull of memory too. (number of channels is really important, pci bandwith too (pcie3 can limit, try to have pcie4 with epyc for example :)
 
  • Like
Reactions: Ben B
So I did some research. What i do is to tune with mellanox settings for tcp:

Code:
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=4194304
sysctl -w net.core.rmem_default=4194304
sysctl -w net.core.wmem_default=4194304
sysctl -w net.core.optmem_max=4194304
sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.ipv4.tcp_adv_win_scale=1

start server on 10.10.10.2
Code:
iperf -s

Do a measurement on other node:
Code:
iperf -c 10.10.10.3 -P 4

And I get only
Code:
iperf -c 10.10.10.2 -P 4
------------------------------------------------------------
Client connecting to 10.10.10.2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.10.10.3 port 51610 connected with 10.10.10.2 port 5001
[  5] local 10.10.10.3 port 51614 connected with 10.10.10.2 port 5001
[  6] local 10.10.10.3 port 51616 connected with 10.10.10.2 port 5001
[  4] local 10.10.10.3 port 51612 connected with 10.10.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  19.8 GBytes  17.0 Gbits/sec
[  5]  0.0-10.0 sec  19.9 GBytes  17.1 Gbits/sec
[  6]  0.0-10.0 sec  11.5 GBytes  9.88 Gbits/sec
[  4]  0.0-10.0 sec  11.4 GBytes  9.81 Gbits/sec
[SUM]  0.0-10.0 sec  62.6 GBytes  53.8 Gbits/sec

So what I'm doing wrong?

pveversion:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
ceph: 15.2.14-pve1~bpo10
ceph-fuse: 15.2.14-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

BareMetal
Diff:
Dell R740XD
2x Xeon Gold 6138
12x 32GB RAM
Dual Port 100GB Mellanox Connect-X6 (mode Ethernet, connected as meshnetwork (switchless)) placed in PCI Gen3.x x16 none shared Pci Slot
8x NVME 1.0TB drives
 
Last edited:
Hello,

it was the cpu c-state, i have to set the cpu to performance with following for Intel Xeon Gold in /etc/default/grub:

Code:
intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll intel_pstate=disable

Now I get it nearly 100 Gbit/s

And VMs get throuput of 2.5 GB/s Write and 6.3 GB/s read instead of 1.2 Gb/s Write and 3.1 GB/s Read (cause the Intel P4500 are read intensive nmves).
 
Hello,

it was the cpu c-state, i have to set the cpu to performance with following for Intel Xeon Gold in /etc/default/grub:

Code:
intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll intel_pstate=disable

Now I get it nearly 100 Gbit/s

And VMs get throuput of 2.5 GB/s Write and 6.3 GB/s read instead of 1.2 Gb/s Write and 3.1 GB/s Read (cause the Intel P4500 are read intensive nmves).
I'm using same grub config on xeon gold too to have always max cpu frequency on my ceph cluster.
(I don't known you usecase here, as you only have asked about iperf ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!