Hi there
I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable Ethernet 100GbE QSFP28 0.5m
I followed the guide: Full Mesh Network for Ceph Server
and in particular I used Open vSwitch to configure the network, this is the configuration of node 2 (IP node1: 10.15.15.1 - IP node3: 10.15.15.2):
the speed of the card is correctly recognized at 100000Mb/s on the proxmox node:
If I run a speed test between two nodes I proxmox with iperf:
Server
Client
The speed of 92.8 Gbits / sec is very good.
I created 3 VMs one for each proxmox node and connected the vmbr1 network card to access the ceph public network and connect the shared storage via cephFS:
If I run the same hyperf test between the VMs I get this result:
Server
Client
Why is the total resulting speed only 9.35 Gbits/sec?
I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable Ethernet 100GbE QSFP28 0.5m
Bash:
# lspci -vv -s 98:00.0
98:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Subsystem: Super Micro Computer Inc MT2892 Family [ConnectX-6 Dx]
Physical Slot: 0-2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 18
NUMA node: 1
Region 0: Memory at 206ffc000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at dba00000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn+
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [48] Vital Product Data
Product Name: Supermicro Network Adapter
Read-only fields:
[PN] Part number: AOC-A100G-m2CM
[V0] Vendor specific: 22.31.1014
[V1] Vendor specific: 1.00
[SN] Serial number: OA221S052953
[VA] Vendor specific: 2
[V2] Vendor specific: 3CECEF5C7DB2
[V3] Vendor specific: 3CECEF5C7DB3
[V4] Vendor specific:
[V5] Vendor specific:
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: 101e
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000206ffe800000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [320 v1] Lane Margining at the Receiver <?>
Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [420 v1] Data Link Feature <?>
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
I followed the guide: Full Mesh Network for Ceph Server
and in particular I used Open vSwitch to configure the network, this is the configuration of node 2 (IP node1: 10.15.15.1 - IP node3: 10.15.15.2):
Bash:
cat /etc/network/interfaces
## ceph public network ##
##
auto enp152s0f0np0
iface enp152s0f0np0 inet manual
ovs_type OVSPort
ovs_bridge vmbr1
ovs_mtu 9000
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged
auto enp152s0f1np1
iface enp152s0f1np1 inet manual
ovs_type OVSPort
ovs_bridge vmbr1
ovs_mtu 9000
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged
auto vmbr1
iface vmbr1 inet static
address 10.15.15.2/24
ovs_type OVSBridge
ovs_port enp152s0f0np0 enp152s0f1np1
ovs_mtu 9000
up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6
post-up sleep 10
##
## ceph public network ##
the speed of the card is correctly recognized at 100000Mb/s on the proxmox node:
Bash:
# ethtool enp152s0f0np0
Settings for enp152s0f0np0:
Supported ports: [ Backplane ]
Supported link modes: 1000baseT/Full
[omissis]
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Link partner advertised link modes: Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: g
Current message level: 0x00000004 (4)
link
Link detected: yes
If I run a speed test between two nodes I proxmox with iperf:
Server
Bash:
# iperf -s -p 9999
------------------------------------------------------------
Server listening on TCP port 9999
TCP window size: 128 KByte (default)
------------------------------------------------------------
Bash:
# iperf -e -c 10.15.15.1 -P 4 -p 9999
[ 4] local 10.15.15.2%vmbr1 port 57286 connected with 10.15.15.1 port 9999 (MSS=8948) (ct=0.07 ms)
------------------------------------------------------------
Client connecting to 10.15.15.1, TCP port 9999 with pid 1931055 (4 flows)
Write buffer size: 128 KByte
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 5] local 10.15.15.2%vmbr1 port 57288 connected with 10.15.15.1 port 9999 (MSS=8948) (ct=0.08 ms)
[ 3] local 10.15.15.2%vmbr1 port 57284 connected with 10.15.15.1 port 9999 (MSS=8948) (ct=0.10 ms)
[ 6] local 10.15.15.2%vmbr1 port 57290 connected with 10.15.15.1 port 9999 (MSS=8948) (ct=0.06 ms)
[ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd/RTT NetPwr
[ 6] 0.0000-10.0001 sec 27.1 GBytes 23.3 Gbits/sec 222352/0 0 3163K/1077 us 2706020.04
[ 5] 0.0000-10.0000 sec 27.1 GBytes 23.3 Gbits/sec 222267/0 0 3180K/1001 us 2910365.82
[ 3] 0.0000-10.0001 sec 26.9 GBytes 23.1 Gbits/sec 220141/0 0 3224K/1248 us 2312000.94
[ 4] 0.0000-10.0000 sec 26.9 GBytes 23.1 Gbits/sec 220147/0 0 3259K/1162 us 2483206.39
[ ID] Interval Transfer Bandwidth
[SUM] 0.0000-10.0000 sec 108 GBytes 92.8 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) = 0.058/0.078/0.102/0.050 ms (tot/err) = 4/0
The speed of 92.8 Gbits / sec is very good.
I created 3 VMs one for each proxmox node and connected the vmbr1 network card to access the ceph public network and connect the shared storage via cephFS:
Bash:
# cat /etc/pve/nodes/server01px/qemu-server/101.conf
agent: 1
boot: order=virtio0;ide2;net0
sockets: 2
cores: 10
memory: 81920
meta: creation-qemu=6.1.1,ctime=1648914236
name: docker101
net0: virtio=6E:CB:93:71:07:09,bridge=vmbr0,firewall=1
net1: virtio=DE:D9:D4:A8:4F:3A,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=e54cff81-eb71-4321-adfe-219de8e5f258
ide2: none,media=cdrom
virtio0: CEPH-NVME-3:vm-101-disk-0,cache=writeback,size=20G
virtio1: CEPH-NVME-3:vm-101-disk-1,cache=writeback,size=20G
virtio2: CEPH-NVME-3:vm-101-disk-2,cache=writeback,size=40G
vmgenid: b4b6bb91-931a-41cd-bbb8-57bd6cd92f07
If I run the same hyperf test between the VMs I get this result:
Server
Bash:
# iperf -s -p 9999
------------------------------------------------------------
Server listening on TCP port 9999
TCP window size: 128 KByte (default)
------------------------------------------------------------
Bash:
# iperf -e -c 10.15.15.101 -P 5 -p 9999
------------------------------------------------------------
Client connecting to 10.15.15.101, TCP port 9999 with pid 3261831
Write buffer size: 128 KByte
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[ 4] local 10.15.15.102 port 42560 connected with 10.15.15.101 port 9999 (ct=1.02 ms)
[ 5] local 10.15.15.102 port 42558 connected with 10.15.15.101 port 9999 (ct=1.09 ms)
[ 3] local 10.15.15.102 port 42556 connected with 10.15.15.101 port 9999 (ct=1.15 ms)
[ 7] local 10.15.15.102 port 42564 connected with 10.15.15.101 port 9999 (ct=1.50 ms)
[ 6] local 10.15.15.102 port 42562 connected with 10.15.15.101 port 9999 (ct=0.91 ms)
[ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd/RTT NetPwr
[ 4] 0.0000-10.0008 sec 2.12 GBytes 1.82 Gbits/sec 17344/0 0 -1K/1160 us 195959.07
[ 5] 0.0000-10.0053 sec 2.21 GBytes 1.90 Gbits/sec 18121/0 0 -1K/823 us 288445.02
[ 7] 0.0000-10.0018 sec 2.21 GBytes 1.90 Gbits/sec 18078/0 0 -1K/382 us 620183.51
[ 6] 0.0000-10.0048 sec 2.22 GBytes 1.90 Gbits/sec 18148/0 0 -1K/340 us 699280.42
[ 3] 0.0000-10.0059 sec 2.15 GBytes 1.84 Gbits/sec 17572/0 0 -1K/452 us 509254.25
[SUM] 0.0000-10.0059 sec 10.9 GBytes 9.35 Gbits/sec 89263/0 0
Why is the total resulting speed only 9.35 Gbits/sec?