25GbE network capped at 10GbE

Mrt12

Well-Known Member
May 19, 2019
145
11
58
44
CH
Good day,
I have two Proxmox nodes, both of which are connected via SFP28 modules and fibre optic cable with 25 GBit network. I can confirm that on both nodes, the link is up and running at 25 GbE:

First node:

Code:
root@pve0:~# ethtool ens2f0np0
Settings for ens2f0np0:
    Supported ports: [ FIBRE ]
    Supported link modes:   25000baseSR/Full
                            10000baseSR/Full
    Supported pause frame use: Symmetric Receive-only
    Supports auto-negotiation: Yes
    Supported FEC modes: RS     BASER
    Advertised link modes:  25000baseSR/Full
                            10000baseSR/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Speed: 25000Mb/s
    Lanes: 1
    Duplex: Full
    Auto-negotiation: on
    Port: FIBRE
    PHYAD: 1
    Transceiver: internal
    Supports Wake-on: d
    Wake-on: d
    Link detected: yes

Second node:

Code:
root@pve1:~# ethtool ens5f0np0
Settings for ens5f0np0:
    Supported ports: [ FIBRE ]
    Supported link modes:   1000baseKX/Full
                            10000baseKR/Full
                            25000baseCR/Full
                            25000baseKR/Full
                            25000baseSR/Full
    Supported pause frame use: Symmetric
    Supports auto-negotiation: Yes
    Supported FEC modes: None     RS     BASER
    Advertised link modes:  1000baseKX/Full
                            10000baseKR/Full
                            25000baseCR/Full
                            25000baseKR/Full
                            25000baseSR/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: RS
    Speed: 25000Mb/s
    Duplex: Full
    Auto-negotiation: on
    Port: FIBRE
    PHYAD: 0
    Transceiver: internal
    Supports Wake-on: d
    Wake-on: d
    Link detected: yes

So according to the above output, I would assume that my both nodes have 25 GBit link speed negotiated.

However, when I run iperf, I can barely see 12 GBits:

Code:
[  5]   0.00-1.00   sec  1.26 GBytes  10.8 Gbits/sec    0   3.30 MBytes       
[  5]   1.00-2.00   sec  1.27 GBytes  10.9 Gbits/sec    0   3.30 MBytes       
[  5]   2.00-3.00   sec  1.30 GBytes  11.2 Gbits/sec    0   3.30 MBytes       
[  5]   3.00-4.00   sec  1.24 GBytes  10.6 Gbits/sec    0   3.30 MBytes       
[  5]   4.00-5.00   sec  1.17 GBytes  10.1 Gbits/sec    0   3.30 MBytes       
[  5]   5.00-6.00   sec  1.20 GBytes  10.3 Gbits/sec    0   3.30 MBytes       
[  5]   6.00-7.00   sec  1.22 GBytes  10.5 Gbits/sec    0   3.30 MBytes       
[  5]   7.00-8.00   sec  1.33 GBytes  11.4 Gbits/sec    0   3.30 MBytes       
[  5]   8.00-9.00   sec  1.24 GBytes  10.6 Gbits/sec    0   3.30 MBytes       
[  5]   9.00-10.00  sec  1.35 GBytes  11.6 Gbits/sec    0   3.30 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec                  receiver

So I wonder what is going on here?

I found a linux 25 GBit network tuning guide by Broadcom, and adjusted the RX and TX buffers of my network cards:

Code:
root@pve1:~# ethtool -g ens5f0np0
Ring parameters for ens5f0np0:
Pre-set maximums:
RX:        8192
RX Mini:    n/a
RX Jumbo:    n/a
TX:        8192
Current hardware settings:
RX:        8192
RX Mini:    n/a
RX Jumbo:    n/a
TX:        8192
RX Buf Len:        n/a
CQE Size:        n/a
TX Push:    off
TCP data split:    off



root@pve0:~# ethtool -g ens2f0np0
Ring parameters for ens2f0np0:
Pre-set maximums:
RX:        2047
RX Mini:    n/a
RX Jumbo:    8191
TX:        2047
Current hardware settings:
RX:        2047
RX Mini:    n/a
RX Jumbo:    8188
TX:        2047
RX Buf Len:        n/a
CQE Size:        n/a
TX Push:    off
TCP data split:    on

I also think that the PCIe speed should not be the problem. On one node I have

Code:
06:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
    Subsystem: Hewlett Packard Enterprise MT27710 Family [ConnectX-4 Lx]
    Physical Slot: 5
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 32 bytes
    Interrupt: pin B routed to IRQ 113
    IOMMU group: 58
    Region 0: Memory at f4000000 (64-bit, prefetchable) [size=32M]
    Expansion ROM at fb600000 [disabled] [size=1M]
    Capabilities: [60] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25W
        DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl:    ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 8GT/s, Width x8
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [48] Vital Product Data
        Product Name: HPE Eth 10/25Gb 2p 640SFP28 Adptr
        Read-only fields:
            [PN] Part number: 817751-001
            [EC] Engineering changes: E-5727
            [SN] Serial number: IL28170175
            [V0] Vendor specific: PCIe GEN3 x8 10/25Gb 15W
            [V2] Vendor specific: 5817
            [V4] Vendor specific: 040973E45F80
            [V5] Vendor specific: 0E
            [VA] Vendor specific: HP:V2=MFG:V3=FW_VER:V4=MAC:V5=PCAR
            [VB] Vendor specific: HPE ConnectX-4 Lx SFP28
            [V1] Vendor specific: 14.24.00.13     
            [YA] Asset tag: N/A                   
            [V3] Vendor specific: 14.31.12.00     
            [V6] Vendor specific: 03.06.04.03     
            [RV] Reserved: checksum good, 0 byte(s) reserved
        End
    Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Vector table: BAR=0 offset=00002000
        PBA: BAR=0 offset=00003000
    Capabilities: [c0] Vendor Specific Information: Len=18 <?>
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap:    First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap:    MFVC- ACS-, Next Function: 0
        ARICtl:    MFVC- ACS-, Function Group: 0
    Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap:    Migration- 10BitTagReq- Interrupt Message Number: 000
        IOVCtl:    Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq-
        IOVSta:    Migration-
        Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 01
        VF offset: 9, stride: 1, Device ID: 1016
        Supported Page Size: 000007ff, System Page Size: 00000001
        Region 0: Memory at 00000000f8000000 (64-bit, prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Capabilities: [230 v1] Access Control Services
        ACSCap:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core

and on the other node I have

Code:
51:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
        Subsystem: Broadcom Inc. and subsidiaries BCM957414A4142CC 10Gb/25Gb Ethernet PCIe
        Physical Slot: 2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 18
        NUMA node: 0
        IOMMU group: 7
        Region 0: Memory at 202fffe10000 (64-bit, prefetchable) [size=64K]
        Region 2: Memory at 202fffd00000 (64-bit, prefetchable) [size=1M]
        Region 4: Memory at 202fffe22000 (64-bit, prefetchable) [size=8K]
        Expansion ROM at d0e80000 [disabled] [size=512K]
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
                Product Name: Broadcom P225p NetXtreme-E Dual-port 10Gb/25Gb Ethernet PCIe Adapter
                Read-only fields:
                        [PN] Part number: BCM957414A4142CC
                        [MN] Manufacture ID: 14E4
                        [V0] Vendor specific: 228.1.111.0
                        [V1] Vendor specific: 228.0.128.0
                        [V3] Vendor specific: 228.0.116.0
                        [V6] Vendor specific: 228.0.128.0
                        [V7] Vendor specific: 0.0.0
                        [V8] Vendor specific: 228.0.116.0
                        [V9] Vendor specific: 0.0.0
                        [VA] Vendor specific: 228.0.116.0
                        [SN] Serial number: A414223020023HFG
                        [VB] Vendor specific: REV021DEV000
                        [RV] Reserved: checksum good, 161 byte(s) reserved
                End
        Capabilities: [a0] MSI-X: Enable+ Count=148 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000940
        Capabilities: [ac] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
                DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 04000001 fd00000f 51020000 00000000
        Capabilities: [13c v1] Device Serial Number 14-23-f2-ff-fe-58-41-f0
        Capabilities: [150 v1] Power Budgeting <?>
        Capabilities: [160 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
        Capabilities: [1b0 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [230 v1] Transaction Processing Hints
                Interrupt vector mode supported
                Device specific mode supported
                Steering table in MSI-X table
        Capabilities: [300 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [200 v1] Precision Time Measurement
                PTMCap: Requester:+ Responder:- Root:-
                PTMClockGranularity: Unimplemented
                PTMControl: Enabled:- RootSelected:-
                PTMEffectiveGranularity: Unknown
        Kernel driver in use: bnxt_en
        Kernel modules: bnxt_en

So I think
a) PCIe is fast enough
b) the buffers are set large enough
c) the 25 GBit speed is negotiated

but still, I have not the expected iperf performance. Why is this?
 
Hi,

However, when I run iperf, I can barely see 12 GBits:
how exactly are you running iperf3? Are you using the -P, e.g. iperf3 -P8 ...
This enables parallel stream, as your machine might be just capped by the CPU in this regard.
 
Hi,


how exactly are you running iperf3? Are you using the -P, e.g. iperf3 -P8 ...
This enables parallel stream, as your machine might be just capped by the CPU in this regard.

yes exactly, I also had this idea.
However, I tested iperf with all variants:

iperf3 (with no options)
iperf3 -P 4
iperf3 -P 8
iperf3 -P16

however, the difference between all the variants was nonexistent to negligible.

I have my Proxmox connected to a bridge, and the bridge is connected to a bond, and the bond contains the 25G interface, as well as an 1G "emergency failover" interface, in case the 25G link fails. But I think the speed issue has nothing to do with the bond, as removing the bond does not improve the speed at all.
 
Do you have a 25 gigabit link all the way?

Also:

I had a Intel X520 with which I struggled to achieve 1 gigabit from inside a VM. The card is a bit old. I upgraded to a Intel E810. With this I could achieve 10 gigabit vm to vm through a pfSense firewall (routed) while reading the data from a disk and writing that to a disk. That was even before I did a passthrough of one port from that card to the pfSense firewall.

I'm just saying that all cards do not behave the same. Also I put the NIC in the primary PCIe slot used mainly for GPUs on these desktop motherboards.

I remember reading some publications a few years ago in which the work was revealed on the extent of tweaking involved in getting to max on multigigabit cards. You need deep buffers and preferrably jumbopackets to keep the interrupt rates down.
 
Hi,

yes I have checked with lspci, that my NIC has a PCIe port that has sufficient bandwidth. I checked with dmesg:

Code:
root@pve1:~# dmesg | grep PCIe
...
[    2.661103] mlx5_core 0000:06:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    3.179777] mlx5_core 0000:06:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
...


root@pve0:~# dmesg | grep PCIe
...
[    2.312191] bnxt_en 0000:51:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    2.356801] bnxt_en 0000:51:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
...

so to my understanding, both NICs are connected to PCIe links that support 63 GBit/sec transfer rate, so much much faster than the 25 GbE. In fact, the NICs are dual port, so I would expect that I could aggregate both ports and achieve 50 GBit/sec without saturating the PCIe link. However I am far from this, as I cannot even get close to 20 GBit.

Also I increased the buffers a bit:

ethtool -G ens2f0np0 rx 2047 tx 2047

but without much help.
 
https://software.es.net/iperf/faq.html

Code:
iperf3 parallel stream performance is much less than iperf2. Why?
Versions of iperf3 before version 3.16 were all single threaded, and iperf2 is multi-threaded. This could result in a performance gap because iperf3 was only able to use one CPU core on a host, which turned into a bottleneck when trying to do high bitrate tests (faster than about 25 Gbps).

Beginning with version 3.16, iperf3 is multi-threaded, which allows it to take advantage of multiple CPU cores during a test (one thread per stream). iperf3 has been observed to send and receive approximately 160Gbps on a 200Gbps path in a test involving multiple TCP flows, with little or no tuning.

Prior to multi-threading support in iperf3, one might need to use the method described here to achieve faster speeds.

Code:
apt show iperf3
Package: iperf3
Version: 3.12-1+deb12u1
Priority: optional
Section: net
Maintainer: Roberto Lumbreras <rover@debian.org>
Installed-Size: 88.1 kB
Pre-Depends: init-system-helpers (>= 1.54~)
Depends: debconf, adduser, libc6 (>= 2.34), libiperf0 (>= 3.1.3), debconf (>= 0.5) | debconf-2.0
Homepage: http://software.es.net/iperf/
Download-Size: 33.9 kB
APT-Sources: http://ftp.fi.debian.org/debian bookworm/main amd64 Packages
Description: Internet Protocol bandwidth measuring tool
 Iperf3 is a tool for performing network throughput measurements. It can
 test either TCP or UDP throughput.
 .
 This is a new implementation that shares no code with the original
 iperf from NLANR/DAST and also is not backwards compatible.
 .
 This package contains the command line utility.
 
Last edited:
Good day,

OK I finally solved the issue and have now ~25 GBits!

Code:
root@pve1:~# iperf3 -c pve0.mw.iap.unibe.ch
Connecting to host pve0.mw.iap.unibe.ch, port 5201
[  5] local 130.92.113.6 port 46716 connected to 130.92.113.5 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.68 GBytes  23.0 Gbits/sec    0   2.31 MBytes       
[  5]   1.00-2.00   sec  2.68 GBytes  23.1 Gbits/sec    0   2.45 MBytes       
[  5]   2.00-3.00   sec  2.62 GBytes  22.5 Gbits/sec    0   3.21 MBytes       
[  5]   3.00-4.00   sec  2.69 GBytes  23.1 Gbits/sec    0   3.21 MBytes       
[  5]   4.00-5.00   sec  2.68 GBytes  23.0 Gbits/sec    0   3.21 MBytes       
[  5]   5.00-6.00   sec  2.69 GBytes  23.1 Gbits/sec    0   3.21 MBytes       
[  5]   6.00-7.00   sec  2.69 GBytes  23.1 Gbits/sec    0   3.21 MBytes       
[  5]   7.00-8.00   sec  2.69 GBytes  23.1 Gbits/sec    0   3.21 MBytes       
[  5]   8.00-9.00   sec  2.68 GBytes  23.0 Gbits/sec    0   3.21 MBytes       
[  5]   9.00-10.00  sec  2.68 GBytes  23.0 Gbits/sec    0   3.21 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  26.8 GBytes  23.0 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  26.8 GBytes  23.0 Gbits/sec                  receiver

iperf Done.

I could resolve the issue, but in the end, I am not sure which one of the below actually solved the problem:
a) ASPM was enabled in the BIOS, and the Mellanox NIC supports ASPM. It appears to me that the NIC was placed in a lower power mode due to ASPM, that gave then additional latency and thus affected my throughput.
b) I forgot that I once set the CPU scaling governor to "balanced". This improved the temperature a bit, but I have now set it to "performance".

So either one or both of the above fixed the issue and my NICs now run at 25 GBits. Or at least close to. I am happy with the 23 GBit I achieved. In case I need more, I have still one port that is spare at the moment.
 
Also I should mention that there is a tuning guide for 25 and 40 GbE from Broadcom for Linux.
As I have a Broadcom NIC on one side, I tried the Broadcom recommended tunings, and it seems they improve the throughput slightly.

I have the following commands in a script that is automatically called after booting:

Code:
ip link set dev ens2f0np0 txqueuelen 10000
ethtool -G ens2f0np0 rx 2047 tx 2047

The optimal size for the ring buffers (the "tx 2047" and "rx 2047") can be determined as follows


Code:
root@pve0:~# ethtool -g ens2f0np0
Ring parameters for ens2f0np0:
Pre-set maximums:
RX:        2047
RX Mini:    n/a
RX Jumbo:    8191
TX:        2047
Current hardware settings:
RX:        2047
RX Mini:    n/a
RX Jumbo:    8188
TX:        2047
RX Buf Len:        n/a
CQE Size:        n/a
TX Push:    off
TCP data split:    on


and then I just used the preset maximums. Usually, Linux chooses for the "Current hardware settings" a number that is a bit lower. Increasing the buffer size did give me some improved performance.
 
I just tried with iperf and iperf3, i.e. the following versions:

Code:
iperf version 2.1.8 (12 August 2022) pthreads
iperf 3.12 (cJSON 1.7.15)

also I tried with parallel threads, the differences are negligible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!