Benchmark: 3 node AMD EPYC 7742 64-Core, 512G RAM, 3x3 6,4TB Micron 9300 MAX NVMe

Why is there less network traffic when doing sequential reads compared to the Ceph read bandwidth?
You are conduction the benchmark on the Ceph OSD node itself. Some of those reads will be local and do not traverse the network.
 
So I purged the complete Ceph installation and created it new with 1 OSD per NVMe with encrypted OSD.

The first rados bench run did not show a high write performance - since I rebooted the "cpupower frequency-set -g performance" was missing. I ran that at 16:40.
1602691730578.png

The write performance is not as good as with the 4 OSDs per NVMe unencrypted, but the read performance is ok. CPU usage is higher.

Next test is without encryption on the OSD but by using SDEutil. We only need cold encryption - e.g. if we have to change a broken drive. No encryption on the OS disks, only for the NVMes...
 
So here we have one unencrypted OSD per NVMe....
1602748965690.png

The result is not as good as with 4 OSDs per NVMe. It seems the OSD is CPU-bound.

1602749036498.png

Might be a good idea to repeat the test and use taskset to pin relevant processes to a number of CPU Cores to identify that one CPU hungry process.

In other news:
- Since we have Mellanox network cards and a Mellanox switch in place should I try out RDMA (https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide )? Is that supported now? Is it stable enough for production use?
- And why do I miss out on that paying customer batch???
 
- And why do I miss out on that paying customer batch???
The missing key in the your account details.

- Since we have Mellanox network cards and a Mellanox switch in place should I try out RDMA (https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide )? Is that supported now? Is it stable enough for production use?
Ceph is build with upstream defaults. You may try to use RDMA, but then we are not able to provide enterprise support to you.

Might be a good idea to repeat the test and use taskset to pin relevant processes to a number of CPU Cores to identify that one CPU hungry process.
You can also have a look with atop, it may give you greater detail then htop.

The result is not as good as with 4 OSDs per NVMe. It seems the OSD is CPU-bound.
The CPU architecture probably comes to play here. Since our Epyc only has 16c by 8 CCDs, a difference was not visible.
 
Last edited:
First test is to see if the network configuration is in order. 100GBit is 4 network streams combined, so at least 4 processes are required to test for the maximum.
Code:
root@proxmox04:~# iperf -c 10.33.0.15 -P 4
------------------------------------------------------------
Client connecting to 10.33.0.15, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  5] local 10.33.0.14 port 49772 connected with 10.33.0.15 port 5001
[  4] local 10.33.0.14 port 49770 connected with 10.33.0.15 port 5001
[  3] local 10.33.0.14 port 49768 connected with 10.33.0.15 port 5001
[  6] local 10.33.0.14 port 49774 connected with 10.33.0.15 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  29.0 GBytes  24.9 Gbits/sec
[  4]  0.0-10.0 sec  28.9 GBytes  24.8 Gbits/sec
[  3]  0.0-10.0 sec  28.1 GBytes  24.1 Gbits/sec
[  6]  0.0-10.0 sec  28.9 GBytes  24.9 Gbits/sec
[SUM]  0.0-10.0 sec   115 GBytes  98.7 Gbits/sec
root@proxmox04:~#
Hi , I have just nearly the same hardware.

switch sn 2100
Code:
Date and Time:        2020/10/15 16:00:25
Hostname:        switch-a492f4
Uptime:        54m 24s
Software Version:        X86_64 3.9.0300 2020-02-26 19:25:24 x86_64
Model:        x86onie
Host ID:        0C42A1A492F4
System memory:        2193 MB used / 5610 MB free / 7803 MB total
CPU load averages:        3.02 / 3.06 / 3.01
System UUID        97828b5e-d1f2-11ea-8000-1c34daef9500
performance not as yours :(
Code:
Client connecting to 10.101.200.131, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  4] local 10.101.200.132 port 54998 connected with 10.101.200.131 port 5001
[  6] local 10.101.200.132 port 55002 connected with 10.101.200.131 port 5001
[  3] local 10.101.200.132 port 54996 connected with 10.101.200.131 port 5001
[  5] local 10.101.200.132 port 55000 connected with 10.101.200.131 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  19.9 GBytes  17.1 Gbits/sec
[  6]  0.0-10.0 sec  14.7 GBytes  12.6 Gbits/sec
[  3]  0.0-10.0 sec  18.9 GBytes  16.3 Gbits/sec
[  5]  0.0-10.0 sec  14.4 GBytes  12.4 Gbits/sec
[SUM]  0.0-10.0 sec  68.0 GBytes  58.4 Gbits/sec
root@pve02:~#

how did you install latetst connect x5 firmware?
do you use mellanox ofed drivers? i just downloaded them but got this:

Code:
root@pve01:/usr/local/src/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64# ls -latrh
total 192K
drwxr-xr-x 2 root root    3 Sep 15 15:08 src
-rwxr-xr-x 1 root root  12K Sep 15 15:08 uninstall.sh
-rwxr-xr-x 1 root root 158K Sep 15 15:08 mlnxofedinstall
-rwxr-xr-x 1 root root  26K Sep 15 15:08 mlnx_add_kernel_support.sh
-rw-r--r-- 1 root root   12 Sep 15 15:08 .mlnx
-rw-r--r-- 1 root root  956 Sep 15 15:08 LICENSE
drwxr-xr-x 8 root root   12 Sep 15 15:08 docs
-rw-r--r-- 1 root root   11 Sep 15 15:08 distro
-rwxr-xr-x 1 root root  25K Sep 15 15:08 create_mlnx_ofed_installers.pl
-rwxr-xr-x 1 root root 8.1K Sep 15 15:08 common.pl
-rwxr-xr-x 1 root root 2.9K Sep 15 15:08 common_installers.pl
-rw-r--r-- 1 root root    7 Sep 15 15:08 .arch
-rw-r--r-- 1 root root 1.8K Sep 15 15:09 RPM-GPG-KEY-Mellanox
drwxr-xr-x 5 root root   16 Sep 15 15:09 .
drwxr-xr-x 2 root root  144 Sep 15 15:10 DEBS
drwxr-xr-x 3 root root    4 Oct 15 15:38 ..
root@pve01:/usr/local/src/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64# ./mlnxofedinstall
Error: The current MLNX_OFED_LINUX is intended for debian10.0

would be nice to get some tip's from your site
 
Last edited:
@Gerhard W. Recher , I documented that in German....
Code:
- Mellanox mft tools installieren und treiber kompilieren lassen
- Alte Version rausholen und dokumentieren (mlxburn -query -d /dev/mst/mt<TAB><TAB>)
- Neue Version einspielen (https://www.mellanox.com/support/firmware/connectx5en - OPN: MCX516A-CCAT <-- Über die PSID rausgefunden...)
- Es braucht einen Server Reboot, um die Firmware sauber upzugraden!!!

Alte Version

Code:
root@proxmox06:~# mlxburn -d /dev/mst/mt4119_pciconf0 -query
-I- Image type:            FS4
-I- FW Version:            16.26.1040
-I- FW Release Date:       26.9.2019
-I- Product Version:       16.26.1040
-I- Rom Info:              type=UEFI version=14.19.14 cpu=AMD64
-I-                        type=PXE version=3.5.803 cpu=AMD64
-I- Description:           UID                GuidsNumber
-I- Base GUID:             0c42a103002b02c4        8
-I- Base MAC:              0c42a12b02c4            8
-I- Image VSD:             N/A
-I- Device VSD:            N/A
-I- PSID:                  MT_0000000012
-I- Security Attributes:   N/A
root@proxmox06:~#

Neue Version

Code:
root@proxmox06:~# flint -i fw-ConnectX5-rel-16_28_2006-MCX516A-CCA_Ax-UEFI-14.21.17-FlexBoot-3.6.102.bin -d /dev/mst/mt4119_pciconf0 burn

    Current FW version on flash:  16.26.1040
    New FW version:               16.28.2006

Initializing image partition -   OK
Writing Boot image component -   OK
-I- To load new FW run mlxfwreset or reboot machine.
root@proxmox06:~# flint -d /dev/mst/mt4119_pciconf0 query
Image type:            FS4
FW Version:            16.28.2006
FW Version(Running):   16.26.1040
FW Release Date:       15.9.2020
Product Version:       16.26.1040
Rom Info:              type=UEFI version=14.19.14 cpu=AMD64
                       type=PXE version=3.5.803 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             0c42a103002b02c4        8
Base MAC:              0c42a12b02c4            8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000012
Security Attributes:   N/A
root@proxmox06:~#

Switch details (HPE SN2100M):
Code:
Date and Time:        2020/10/15 16:12:14
Hostname:             dc1-switch07
Uptime:             66d 16h 29m 47s
Software Version:     X86_64 3.9.0612 2020-05-08 10:52:22 x86_64
Model:                 x86onie
Host ID:             98039BAE4C68
System memory:         2647 MB used / 5156 MB free / 7803 MB total
CPU load averages:     0.04 / 0.05 / 0.00
System UUID         355b9ae4-4051-11e9-8000-b8599f5c0f80

Connection between Switch and NICs is done with DAC cables provided by FS.com.

I did not install the Mellanox Drivers, I am just using what is coming with ProxMox. Any recommendations on this, @Alwin ?
 
Last edited:
Could you please elaborate? I suppose it is not the one in your signature.
yep, signature is another cluster...

this one:

Code:
3 nodes, pve 6.2.-1 iso install with all patches applied

supermicro 2113S-wn24rt
amd epyc 7502P 2,5Ghz 32c/64t
512GB mem ddr4-3200 cl22
2 samsung pm981 nvme m.2 (raid-1) system zfs
4 nvme 3,2tb samsung pm1725b 2,5'' U.2
dual port broadcom mcn57416 10GBase-T
dual port mellanox connect-x5 qsfp28
100GbE switch sn2100-cb2f
10GbE switch d-link dxs 1210-16tc
 
@Gerhard W. Recher , I documented that in German....
Code:
- Mellanox mft tools installieren und treiber kompilieren lassen
- Alte Version rausholen und dokumentieren (mlxburn -query -d /dev/mst/mt<TAB><TAB>)
- Neue Version einspielen (https://www.mellanox.com/support/firmware/connectx5en - OPN: MCX516A-CCAT <-- Über die PSID rausgefunden...)
- Es braucht einen Server Reboot, um die Firmware sauber upzugraden!!!

Alte Version

Code:
root@proxmox06:~# mlxburn -d /dev/mst/mt4119_pciconf0 -query
-I- Image type:            FS4
-I- FW Version:            16.26.1040
-I- FW Release Date:       26.9.2019
-I- Product Version:       16.26.1040
-I- Rom Info:              type=UEFI version=14.19.14 cpu=AMD64
-I-                        type=PXE version=3.5.803 cpu=AMD64
-I- Description:           UID                GuidsNumber
-I- Base GUID:             0c42a103002b02c4        8
-I- Base MAC:              0c42a12b02c4            8
-I- Image VSD:             N/A
-I- Device VSD:            N/A
-I- PSID:                  MT_0000000012
-I- Security Attributes:   N/A
root@proxmox06:~#

Neue Version

Code:
root@proxmox06:~# flint -i fw-ConnectX5-rel-16_28_2006-MCX516A-CCA_Ax-UEFI-14.21.17-FlexBoot-3.6.102.bin -d /dev/mst/mt4119_pciconf0 burn

    Current FW version on flash:  16.26.1040
    New FW version:               16.28.2006

Initializing image partition -   OK
Writing Boot image component -   OK
-I- To load new FW run mlxfwreset or reboot machine.
root@proxmox06:~# flint -d /dev/mst/mt4119_pciconf0 query
Image type:            FS4
FW Version:            16.28.2006
FW Version(Running):   16.26.1040
FW Release Date:       15.9.2020
Product Version:       16.26.1040
Rom Info:              type=UEFI version=14.19.14 cpu=AMD64
                       type=PXE version=3.5.803 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             0c42a103002b02c4        8
Base MAC:              0c42a12b02c4            8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000012
Security Attributes:   N/A
root@proxmox06:~#

Switch details (HPE SN2100M):
Code:
Date and Time:        2020/10/15 16:12:14
Hostname:             dc1-switch07
Uptime:             66d 16h 29m 47s
Software Version:     X86_64 3.9.0612 2020-05-08 10:52:22 x86_64
Model:                 x86onie
Host ID:             98039BAE4C68
System memory:         2647 MB used / 5156 MB free / 7803 MB total
CPU load averages:     0.04 / 0.05 / 0.00
System UUID         355b9ae4-4051-11e9-8000-b8599f5c0f80

Connection between Switch and NICs is done with DAC cables provided by FS.com.

I did not install the Mellanox Drivers, I am just using what is coming with ProxMox. Any recommendations on this, @Alwin ?
I just found a way to accomplish firmware update, without messing with driver update not coming form proxmox repro!

this is much more straight forward :)

Code:
wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | apt-key add -
download package from mellanox
http://content.mellanox.com/ofed/MLNX_OFED-5.1-2.3.7.1/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64.tgz

/etc/apt/sources.list.d/mlnx_ofed.list
deb file:/usr/local/src/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64/DEBS ./

apt-get update
apt-get install mlnx-fw-updater <----- this will instantly run !

Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX5
  Part Number:      MCX516A-CCA_Ax
  Description:      ConnectX-5 EN network interface card; 100GbE dual-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000012
  PCI Device Name:  45:00.0
  Base GUID:        506b4b0300f37ea0
  Base MAC:         506b4bf37ea0
  Versions:         Current        Available
     FW             16.22.1002     16.28.2006
     PXE            3.5.0403       3.6.0102
     UEFI           14.15.0019     14.21.0017

  Status:           Update required

---------
Found 1 device(s) requiring firmware update...

Device #1: Updating FW 

Restart needed for updates to take effect.
Log File: /tmp/mlnx_fw_update.log
 
Is NUMA enabled in the BIOS? Certain BIOS settings will alter the node count.


Since the Supermicro system uses RISER cards, is it situated/configured correctly?
Numa is on,
raiser card is properly connected....
Code:
lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7502P 32-Core Processor
Stepping:            0
CPU MHz:             1733.730
CPU max MHz:         2500.0000
CPU min MHz:         1500.0000
BogoMIPS:            5000.28
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
 
I just found a way to accomplish firmware update, without messing with driver update not coming form proxmox repro!

this is much more straight forward :)

Code:
wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | apt-key add -
download package from mellanox
http://content.mellanox.com/ofed/MLNX_OFED-5.1-2.3.7.1/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64.tgz

/etc/apt/sources.list.d/mlnx_ofed.list
deb file:/usr/local/src/MLNX_OFED_LINUX-5.1-2.3.7.1-debian10.0-x86_64/DEBS ./

apt-get update
apt-get install mlnx-fw-updater <----- this will instantly run !

Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX5
  Part Number:      MCX516A-CCA_Ax
  Description:      ConnectX-5 EN network interface card; 100GbE dual-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000012
  PCI Device Name:  45:00.0
  Base GUID:        506b4b0300f37ea0
  Base MAC:         506b4bf37ea0
  Versions:         Current        Available
     FW             16.22.1002     16.28.2006
     PXE            3.5.0403       3.6.0102
     UEFI           14.15.0019     14.21.0017

  Status:           Update required

---------
Found 1 device(s) requiring firmware update...

Device #1: Updating FW

Restart needed for updates to take effect.
Log File: /tmp/mlnx_fw_update.log
after firmware update of mellanox cards, still not near 100Gbit/s :(


Code:
iperf -c 10.101.200.131 -P 4 -e
------------------------------------------------------------
Client connecting to 10.101.200.131, TCP port 5001 with pid 5645
Write buffer size:  128 KByte
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 10.101.200.132 port 48008 connected with 10.101.200.131 port 5001
[  4] local 10.101.200.132 port 48010 connected with 10.101.200.131 port 5001
[  5] local 10.101.200.132 port 48012 connected with 10.101.200.131 port 5001
[  6] local 10.101.200.132 port 48014 connected with 10.101.200.131 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  3] 0.00-10.00 sec  18.5 GBytes  15.9 Gbits/sec  151493/0      45533     1314K/327 us
[  4] 0.00-10.00 sec  18.1 GBytes  15.6 Gbits/sec  148536/0      47807      203K/84 us
[  5] 0.00-10.00 sec  17.7 GBytes  15.2 Gbits/sec  144772/0      38680      246K/123 us
[  6] 0.00-10.00 sec  17.8 GBytes  15.3 Gbits/sec  145577/0      38102      223K/84 us
[SUM] 0.00-10.00 sec  72.1 GBytes  61.9 Gbits/sec  590378/0    170122
 
after firmware update of mellanox cards, still not near 100Gbit/s :(


Code:
iperf -c 10.101.200.131 -P 4 -e
------------------------------------------------------------
Client connecting to 10.101.200.131, TCP port 5001 with pid 5645
Write buffer size:  128 KByte
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 10.101.200.132 port 48008 connected with 10.101.200.131 port 5001
[  4] local 10.101.200.132 port 48010 connected with 10.101.200.131 port 5001
[  5] local 10.101.200.132 port 48012 connected with 10.101.200.131 port 5001
[  6] local 10.101.200.132 port 48014 connected with 10.101.200.131 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  3] 0.00-10.00 sec  18.5 GBytes  15.9 Gbits/sec  151493/0      45533     1314K/327 us
[  4] 0.00-10.00 sec  18.1 GBytes  15.6 Gbits/sec  148536/0      47807      203K/84 us
[  5] 0.00-10.00 sec  17.7 GBytes  15.2 Gbits/sec  144772/0      38680      246K/123 us
[  6] 0.00-10.00 sec  17.8 GBytes  15.3 Gbits/sec  145577/0      38102      223K/84 us
[SUM] 0.00-10.00 sec  72.1 GBytes  61.9 Gbits/sec  590378/0    170122
my fault ... found it.... mtu was 1512 ... set to 9000....

Code:
iperf -c 10.101.200.131 -P 4 -e
------------------------------------------------------------
Client connecting to 10.101.200.131, TCP port 5001 with pid 18556
Write buffer size:  128 KByte
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.101.200.132 port 48054 connected with 10.101.200.131 port 5001
[  4] local 10.101.200.132 port 48056 connected with 10.101.200.131 port 5001
[  5] local 10.101.200.132 port 48058 connected with 10.101.200.131 port 5001
[  6] local 10.101.200.132 port 48060 connected with 10.101.200.131 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  3] 0.00-10.00 sec  20.4 GBytes  17.5 Gbits/sec  166775/0      13650      428K/155 us
[  4] 0.00-10.00 sec  22.7 GBytes  19.5 Gbits/sec  186223/0      14580      455K/190 us
[  5] 0.00-10.00 sec  22.2 GBytes  19.1 Gbits/sec  182100/0      15059      376K/194 us
[  6] 0.00-10.00 sec  21.4 GBytes  18.4 Gbits/sec  175001/0      14000      323K/136 us
[SUM] 0.00-10.00 sec  86.7 GBytes  74.5 Gbits/sec  710099/0     57289
root@pve02:~#
 
74.5 Gbits/sec... - let's see your /proc/cmdline!
Check if "amd_iommu=on iommu=pt pcie_aspm=off" is already active!
nope is not ... how to fix this ?

Code:
BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.65-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
 
Looks like you are booting using grub...
-> Adjust GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
-> Run "update-grub"
-> Reboot
 
Last edited:
Looks like you are booting using grub...
-> Adjust GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
-> Run "update-grub"
-> Reboot
this was a match winner :) thx for your responses !

Code:
iperf -c 10.101.200.131 -P 4 -e
------------------------------------------------------------
Client connecting to 10.101.200.131, TCP port 5001 with pid 3252
Write buffer size:  128 KByte
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  6] local 10.101.200.132 port 48396 connected with 10.101.200.131 port 5001
[  4] local 10.101.200.132 port 48392 connected with 10.101.200.131 port 5001
[  3] local 10.101.200.132 port 48390 connected with 10.101.200.131 port 5001
[  5] local 10.101.200.132 port 48394 connected with 10.101.200.131 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  6] 0.00-10.00 sec  21.1 GBytes  18.1 Gbits/sec  172987/0        569     1960K/352 us
[  4] 0.00-10.00 sec  38.1 GBytes  32.7 Gbits/sec  312323/0        751     1951K/268 us
[  3] 0.00-10.00 sec  17.8 GBytes  15.3 Gbits/sec  146109/0        671     1137K/306 us
[  5] 0.00-10.00 sec  38.2 GBytes  32.8 Gbits/sec  312783/0        837     1846K/155 us
[SUM] 0.00-10.00 sec   115 GBytes  99.0 Gbits/sec  944202/0      2828
root@pve02:~#

tcp settings:

Code:
net.ipv4.tcp_timestamps=0
#2. Enable the TCP selective acks option for better throughput:
net.ipv4.tcp_sack=1
#3. Increase the maximum length of processor input queues:
net.core.netdev_max_backlog=250000
#4. Increase the TCP maximum and default buffer sizes using setsockopt():
net.core.rmem_max=4194304
net.core.wmem_max=4194304
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=4194304
#5. Increase memory thresholds to prevent packet dropping:
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 65536 4194304"
#6. Enable low latency mode for TCP:
net.ipv4.tcp_low_latency=1
#The following variable is used to tell the kernel how much of the socket buffer space should be used for TCP window size, and how much to save for an application buffer.
net.ipv4.tcp_adv_win_scale=1
#A value of 1 means the socket buffer will be divided evenly between TCP windows size and application.

and:
Code:
ethtool -G enp69s0f0 rx 8192 tx 8192
ethtool -G enp69s0f1 rx 8192 tx 8192
 
@Gerhard W. Recher , I tried your sysctl tuning settings yesterday and today, but they performed worse compared to the ones I got initially from https://fasterdata.es.net/host-tuning/linux/ .

My current and final sysctl network tuning:
Code:
# https://fasterdata.es.net/host-tuning/linux/100g-tuning/
# allow testing with buffers up to 512MB
net.core.rmem_max = 536870912
net.core.wmem_max = 536870912
# increase Linux autotuning TCP buffer limit to 256MB
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
# recommended default congestion control is htcp
#net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7+/Debian8+ hosts
#net.core.default_qdisc = fq

The rest is default.
 
So here the IOps test with 4K blocks
1602888423045.png

So I believe there is nothing left to change on the configuration that would further improve the performance.

Next is tests from within some VMs.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!