Good day,
I have two Proxmox nodes, both of which are connected via SFP28 modules and fibre optic cable with 25 GBit network. I can confirm that on both nodes, the link is up and running at 25 GbE:
First node:
Second node:
So according to the above output, I would assume that my both nodes have 25 GBit link speed negotiated.
However, when I run iperf, I can barely see 12 GBits:
So I wonder what is going on here?
I found a linux 25 GBit network tuning guide by Broadcom, and adjusted the RX and TX buffers of my network cards:
I also think that the PCIe speed should not be the problem. On one node I have
and on the other node I have
So I think
a) PCIe is fast enough
b) the buffers are set large enough
c) the 25 GBit speed is negotiated
but still, I have not the expected iperf performance. Why is this?
I have two Proxmox nodes, both of which are connected via SFP28 modules and fibre optic cable with 25 GBit network. I can confirm that on both nodes, the link is up and running at 25 GbE:
First node:
Code:
root@pve0:~# ethtool ens2f0np0
Settings for ens2f0np0:
Supported ports: [ FIBRE ]
Supported link modes: 25000baseSR/Full
10000baseSR/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: RS BASER
Advertised link modes: 25000baseSR/Full
10000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 25000Mb/s
Lanes: 1
Duplex: Full
Auto-negotiation: on
Port: FIBRE
PHYAD: 1
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Link detected: yes
Second node:
Code:
root@pve1:~# ethtool ens5f0np0
Settings for ens5f0np0:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None RS BASER
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: RS
Speed: 25000Mb/s
Duplex: Full
Auto-negotiation: on
Port: FIBRE
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Link detected: yes
So according to the above output, I would assume that my both nodes have 25 GBit link speed negotiated.
However, when I run iperf, I can barely see 12 GBits:
Code:
[ 5] 0.00-1.00 sec 1.26 GBytes 10.8 Gbits/sec 0 3.30 MBytes
[ 5] 1.00-2.00 sec 1.27 GBytes 10.9 Gbits/sec 0 3.30 MBytes
[ 5] 2.00-3.00 sec 1.30 GBytes 11.2 Gbits/sec 0 3.30 MBytes
[ 5] 3.00-4.00 sec 1.24 GBytes 10.6 Gbits/sec 0 3.30 MBytes
[ 5] 4.00-5.00 sec 1.17 GBytes 10.1 Gbits/sec 0 3.30 MBytes
[ 5] 5.00-6.00 sec 1.20 GBytes 10.3 Gbits/sec 0 3.30 MBytes
[ 5] 6.00-7.00 sec 1.22 GBytes 10.5 Gbits/sec 0 3.30 MBytes
[ 5] 7.00-8.00 sec 1.33 GBytes 11.4 Gbits/sec 0 3.30 MBytes
[ 5] 8.00-9.00 sec 1.24 GBytes 10.6 Gbits/sec 0 3.30 MBytes
[ 5] 9.00-10.00 sec 1.35 GBytes 11.6 Gbits/sec 0 3.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 12.6 GBytes 10.8 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 12.6 GBytes 10.8 Gbits/sec receiver
So I wonder what is going on here?
I found a linux 25 GBit network tuning guide by Broadcom, and adjusted the RX and TX buffers of my network cards:
Code:
root@pve1:~# ethtool -g ens5f0np0
Ring parameters for ens5f0np0:
Pre-set maximums:
RX: 8192
RX Mini: n/a
RX Jumbo: n/a
TX: 8192
Current hardware settings:
RX: 8192
RX Mini: n/a
RX Jumbo: n/a
TX: 8192
RX Buf Len: n/a
CQE Size: n/a
TX Push: off
TCP data split: off
root@pve0:~# ethtool -g ens2f0np0
Ring parameters for ens2f0np0:
Pre-set maximums:
RX: 2047
RX Mini: n/a
RX Jumbo: 8191
TX: 2047
Current hardware settings:
RX: 2047
RX Mini: n/a
RX Jumbo: 8188
TX: 2047
RX Buf Len: n/a
CQE Size: n/a
TX Push: off
TCP data split: on
I also think that the PCIe speed should not be the problem. On one node I have
Code:
06:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Subsystem: Hewlett Packard Enterprise MT27710 Family [ConnectX-4 Lx]
Physical Slot: 5
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin B routed to IRQ 113
IOMMU group: 58
Region 0: Memory at f4000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fb600000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [48] Vital Product Data
Product Name: HPE Eth 10/25Gb 2p 640SFP28 Adptr
Read-only fields:
[PN] Part number: 817751-001
[EC] Engineering changes: E-5727
[SN] Serial number: IL28170175
[V0] Vendor specific: PCIe GEN3 x8 10/25Gb 15W
[V2] Vendor specific: 5817
[V4] Vendor specific: 040973E45F80
[V5] Vendor specific: 0E
[VA] Vendor specific: HP:V2=MFG:V3=FW_VER:V4=MAC:V5=PCAR
[VB] Vendor specific: HPE ConnectX-4 Lx SFP28
[V1] Vendor specific: 14.24.00.13
[YA] Asset tag: N/A
[V3] Vendor specific: 14.31.12.00
[V6] Vendor specific: 03.06.04.03
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration- 10BitTagReq- Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq-
IOVSta: Migration-
Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 01
VF offset: 9, stride: 1, Device ID: 1016
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 00000000f8000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
and on the other node I have
Code:
51:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
Subsystem: Broadcom Inc. and subsidiaries BCM957414A4142CC 10Gb/25Gb Ethernet PCIe
Physical Slot: 2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 18
NUMA node: 0
IOMMU group: 7
Region 0: Memory at 202fffe10000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at 202fffd00000 (64-bit, prefetchable) [size=1M]
Region 4: Memory at 202fffe22000 (64-bit, prefetchable) [size=8K]
Expansion ROM at d0e80000 [disabled] [size=512K]
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Product Name: Broadcom P225p NetXtreme-E Dual-port 10Gb/25Gb Ethernet PCIe Adapter
Read-only fields:
[PN] Part number: BCM957414A4142CC
[MN] Manufacture ID: 14E4
[V0] Vendor specific: 228.1.111.0
[V1] Vendor specific: 228.0.128.0
[V3] Vendor specific: 228.0.116.0
[V6] Vendor specific: 228.0.128.0
[V7] Vendor specific: 0.0.0
[V8] Vendor specific: 228.0.116.0
[V9] Vendor specific: 0.0.0
[VA] Vendor specific: 228.0.116.0
[SN] Serial number: A414223020023HFG
[VB] Vendor specific: REV021DEV000
[RV] Reserved: checksum good, 161 byte(s) reserved
End
Capabilities: [a0] MSI-X: Enable+ Count=148 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000940
Capabilities: [ac] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn+
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04000001 fd00000f 51020000 00000000
Capabilities: [13c v1] Device Serial Number 14-23-f2-ff-fe-58-41-f0
Capabilities: [150 v1] Power Budgeting <?>
Capabilities: [160 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
Capabilities: [1b0 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [230 v1] Transaction Processing Hints
Interrupt vector mode supported
Device specific mode supported
Steering table in MSI-X table
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [200 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:- Root:-
PTMClockGranularity: Unimplemented
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Kernel driver in use: bnxt_en
Kernel modules: bnxt_en
So I think
a) PCIe is fast enough
b) the buffers are set large enough
c) the 25 GBit speed is negotiated
but still, I have not the expected iperf performance. Why is this?