We have 3 Nodes (Proxmox 6.4-13 latest version) with Mellanox dual port Connect-x 6 cards 100G connected as mesh network with mode eth and ROCEv2, driver OFED-5.4-1.0.3. The uses PCI x16 gen 3.0 8GB/s. MTU is configured to 9000, so they should have more throughput.
If i start a iperf3 measurement I can only reach ~30 Gbits/s within the cluster, regardless wich node is the server:
So what can I do to increase throughput for ceph cluster network?
Code:
3b:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Subsystem: Mellanox Technologies MT28908 Family [ConnectX-6]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 922
NUMA node: 0
Region 0: Memory at ae000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab000000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed unknown, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), dual-port QSFP56
Read-only fields:
[PN] Part number: MCX653106A-ECAT
[EC] Engineering changes: AD
[V2] Vendor specific: MCX653106A-ECAT
[SN] Serial number: xxxxxxxxxxxx
[V3] Vendor specific: d8bf42051f87eb118000b8cef65d458e
[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653106A
[V0] Vendor specific: PCIeGen4 x16
[RV] Reserved: checksum good, 1 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [1c0 v1] #19
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [320 v1] #27
Capabilities: [370 v1] #26
Capabilities: [420 v1] #25
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
If i start a iperf3 measurement I can only reach ~30 Gbits/s within the cluster, regardless wich node is the server:
Code:
Connecting to host 10.10.10.3, port 5201
[ 5] local 10.10.10.1 port 59730 connected to 10.10.10.3 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 3.78 GBytes 32.4 Gbits/sec 0 1.80 MBytes
[ 5] 1.00-2.00 sec 4.67 GBytes 40.1 Gbits/sec 0 2.74 MBytes
[ 5] 2.00-3.00 sec 3.27 GBytes 28.1 Gbits/sec 0 2.74 MBytes
[ 5] 3.00-4.00 sec 3.61 GBytes 31.0 Gbits/sec 0 2.74 MBytes
[ 5] 4.00-5.00 sec 3.55 GBytes 30.5 Gbits/sec 0 2.74 MBytes
[ 5] 5.00-6.00 sec 4.81 GBytes 41.3 Gbits/sec 0 2.74 MBytes
[ 5] 6.00-7.00 sec 4.72 GBytes 40.5 Gbits/sec 0 2.74 MBytes
[ 5] 7.00-8.00 sec 3.92 GBytes 33.7 Gbits/sec 0 3.04 MBytes
[ 5] 8.00-9.00 sec 3.53 GBytes 30.3 Gbits/sec 0 3.04 MBytes
[ 5] 9.00-10.00 sec 3.52 GBytes 30.2 Gbits/sec 0 3.04 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 39.4 GBytes 33.8 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 39.4 GBytes 33.8 Gbits/sec receiver
So what can I do to increase throughput for ceph cluster network?
Last edited: