Benchmark: 3 node AMD EPYC 7742 64-Core, 512G RAM, 3x3 6,4TB Micron 9300 MAX NVMe

Rainerle · Oct 12, 2020

Hi everbody,

we are currently in the process of replacing our VMware ESXi NFS Netapp setup with a ProxMox Ceph configuration.
We purchased 8 nodes with the following configuration:
- ThomasKrenn 1HE AMD Single-CPU RA1112
- AMD EPYC 7742 (2,25 GHz, 64-Core, 256 MB)
- 512 GB RAM
- 2x 240GB SATA III Intel SSD (ZFS Mirror Boot/OS Disk)
- 3x 6,4TB NVMe Micron 9300 MAX (Ceph OSD)
- Dualport 1GBit/s Intel i350-AM2 (onboard, two Corosync networks)
- Dualport 100GBit/s Mellanox ConnectX-5 QSFP28 (Ceph Cluster, Ceph Public, Proxmox Migration and Proxmox Access networks)

The eight nodes are connected using two 100G HPE SN2100M Mellanox switches using a MLAG configuration (active-passive) with Jumbo Frames and two Huawei CE6810-32T16S4Q-LI.

A three node cluster is planned for test, development and infrastructure KVM VMs, a five node cluster is planned for production KVM VMs.

All systems are already monitored using a Zabbix installation. Switches, network card and other firmwares and ProxMox are at the latest version.

We are currently running benchmarks and comparing our results to the 2020 benchmark on the three node cluster.

This thread is our log book of the problems and results we encounter.

Rainerle · Oct 12, 2020

First test is to see if the network configuration is in order. 100GBit is 4 network streams combined, so at least 4 processes are required to test for the maximum.

Code:

root@proxmox04:~# iperf -c 10.33.0.15 -P 4
------------------------------------------------------------
Client connecting to 10.33.0.15, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  5] local 10.33.0.14 port 49772 connected with 10.33.0.15 port 5001
[  4] local 10.33.0.14 port 49770 connected with 10.33.0.15 port 5001
[  3] local 10.33.0.14 port 49768 connected with 10.33.0.15 port 5001
[  6] local 10.33.0.14 port 49774 connected with 10.33.0.15 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  29.0 GBytes  24.9 Gbits/sec
[  4]  0.0-10.0 sec  28.9 GBytes  24.8 Gbits/sec
[  3]  0.0-10.0 sec  28.1 GBytes  24.1 Gbits/sec
[  6]  0.0-10.0 sec  28.9 GBytes  24.9 Gbits/sec
[SUM]  0.0-10.0 sec   115 GBytes  98.7 Gbits/sec
root@proxmox04:~#

Rainerle · Oct 12, 2020

Adjusted the network settings according to AMD Network Tuning Guide .
We did not use NUMA adjustments as we only have single socket CPUs.

So this is a Zabbix screen over the three nodes. Details over the time ranges:
- From 07:00 - 10:25: Performancetest running over the weekend. promox04 performs better than the other three nodes. It took some time to identify that error. The other two nodes used systemd-boot - promox04 used grub to boot. We only configured "amd_iommu=on iommu=pt pcie_aspm=off" on the first node...
- It took some time to switch proxmox04 to use systemd-boot as well and then to configure /etc/kernel/cmdline properly.
- From 12:25 - 13:30: Running the same iperf test again over lunch break. Throughput is much better now and it still seems stable.
- From 14:40 to 14:45: Adjusted TX and RX Ring Sizes to the hardware and ran a test.
- Took some time to fiddle the additional setting into the configuration (we use ansible here...)
- From 15:45 to 16:15: Ran a test where 04 connects to 05, 05 connects to 06 and 06 connects to 04.
- Disabled C-States and cpuidle
- 17:00 - current: Started the overnight test.

Any feedback on further tuning measures is welcome!

To be continued...

Rainerle · Oct 13, 2020

Result from the over night run

proxmox05 seems flaky, proxmox06 seems to uses less CPU than the rest. We need to check for configuration missmatches, it seems the nodes still have differences in their configuration.

Alwin · Oct 13, 2020

Rainerle said:
We did not use NUMA adjustments as we only have single socket CPUs.

Depending on the CPU setting changed, the NUMA is adjusted as well. You might see the CPU exposing NUMA to the system.

Rainerle said:
root@proxmox04:~# iperf -c 10.33.0.15 -P 4

I also recommend qperf for testing. It will also display the latency of the network.

Rainerle said:
We are currently running benchmarks and comparing our results to the 2020 benchmark on the three node cluster.

And would you please be so kind and cross reference this link on the forum thread of the benchmark paper?
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Rainerle · Oct 13, 2020

Took some time, but found the difference between the hosts: On the first two from previous tests some /etc/sysctl.d/ parameter files had been left behind...

Rainerle · Oct 13, 2020

Hi Alwin,

Alwin said:
Depending on the CPU setting changed, the NUMA is adjusted as well. You might see the CPU exposing NUMA to the system.

The attachment to the initial post contains the output of lspcu showing "NUMA node(s): 1"

qperf was indeed new to me - will have a look at it.

Alwin · Oct 13, 2020

Rainerle said:
The attachment to the initial post contains the output of lspcu showing "NUMA node(s): 1"

Yes, but only from one machine. Or?

Rainerle · Oct 13, 2020

Alwin said:
Yes, but only from one machine. Or?

Puuuh, just checked the other two. Luckily they show "NUMA node(s): 1" as well...

Rainerle · Oct 13, 2020

Applied

Code:

root@proxmox04:~# cat /etc/sysctl.d/90-network-tuning.conf
# https://fasterdata.es.net/host-tuning/linux/100g-tuning/
# allow testing with buffers up to 512MB
net.core.rmem_max = 536870912
net.core.wmem_max = 536870912
# increase Linux autotuning TCP buffer limit to 256MB
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1

to all three nodes and ran "cpupower frequency-set -g performance" before this test:

Looking much better now.

Alwin · Oct 13, 2020

Rainerle said:
Looking much better now.

What looks better? I seem to miss it. Comparing the two graphs, I only see that the first graphs are in 15 min intervals, while the last ones are in 2 min intervals.

Rainerle · Oct 13, 2020

I compared it to the one from the night (Post 4 ).

Ran the first rados bench (rados bench 60 --pool ceph-proxmox-VMs write -b 4M -t 16 --no-cleanup)

Code:

Total time run:         60.0131
Total writes made:      83597
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     5571.92
Stddev Bandwidth:       113.588
Max bandwidth (MB/sec): 5748
Min bandwidth (MB/sec): 5008
Average IOPS:           1392
Stddev IOPS:            28.3971
Max IOPS:               1437
Min IOPS:               1252
Average Latency(s):     0.0114852
Stddev Latency(s):      0.00977799
Max latency(s):         1.03765
Min latency(s):         0.00483123

How did you run three clients against one pool like in your PDF???

Currently using 4 OSDs per NVMe though...

Alwin · Oct 13, 2020

Rainerle said:
How did you run three clients against one pool like in your PDF???

Hehe.

I used tmux with synced panels. But you could just run the jobs with cron or at.

Rainerle said:
Currently using 4 OSDs per NVMe though...

Didn't help in my benchmarks, the write performance was split amongst the OSDs on the U.2. You may see that in your monitoring (atop in my case).

Rainerle said:
Ran the first rados bench (rados bench 60 --pool ceph-proxmox-VMs write -b 4M -t 16 --no-cleanup)

I suppose you will do it later on anyway, but let the benchmark run for longer. With more usage the performance may decline a little but it might be closer to what you might see in production.

Alwin · Oct 13, 2020

Rainerle said:
I compared it to the one from the night (Post 4 ).

That's what I meant with first graphs.

Rainerle · Oct 13, 2020

So how would you add the three of them up then???

Code:

Total time run:         600.009
Total writes made:      369420
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2462.76
Stddev Bandwidth:       172.924
Max bandwidth (MB/sec): 4932
Min bandwidth (MB/sec): 1856
Average IOPS:           615
Stddev IOPS:            43.231
Max IOPS:               1233
Min IOPS:               464
Average Latency(s):     0.0259866
Stddev Latency(s):      0.0310216
Max latency(s):         1.31802
Min latency(s):         0.00567992

Code:

Total time run:         600.085
Total writes made:      362959
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2419.38
Stddev Bandwidth:       143.084
Max bandwidth (MB/sec): 3164
Min bandwidth (MB/sec): 1640
Average IOPS:           604
Stddev IOPS:            35.771
Max IOPS:               791
Min IOPS:               410
Average Latency(s):     0.0264505
Stddev Latency(s):      0.0319855
Max latency(s):         1.39796
Min latency(s):         0.00599677

Code:

Total time run:         600.159
Total writes made:      364969
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2432.48
Stddev Bandwidth:       192.811
Max bandwidth (MB/sec): 5184
Min bandwidth (MB/sec): 1808
Average IOPS:           608
Stddev IOPS:            48.2027
Max IOPS:               1296
Min IOPS:               452
Average Latency(s):     0.0263094
Stddev Latency(s):      0.0311469
Max latency(s):         0.783012
Min latency(s):         0.00582972

The graph looks ok except for the latency bit...

Alwin · Oct 13, 2020

Rainerle said:
So how would you add the three of them up then???

Since they are running in parallel, I just summed them up. Roughly 7.2 GB/s in your post. Just to note, that I have used a bench on each client.

But you may also use fio and run directly against the Ceph pool. This allows for greater flexibility in conducting the benchmark.

Rainerle · Oct 13, 2020

I would say it is enough write testing for now...

Results look stable

Next test tomorrow will be with one OSD per NVMe...

Rainerle · Oct 13, 2020

@Alwin : How did you run three rados bench seq read clients???

First node starts fine, but the next node gives:

Code:

root@proxmox05:~# rados bench 28800 --pool ceph-proxmox-VMs seq -t 16
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
benchmark_data_proxmox05_288534_object2 is not correct!
read got -2
error during benchmark: (2) No such file or directory
error 2: (2) No such file or directory
root@proxmox05

Alwin · Oct 14, 2020

Rainerle said:
I would say it is enough write testing for now...

roughly 60 Gb/s. You might be able the get a little more with tuning Ceph. Sadly io_uring is not a "stable" engine yet, that would help as well.

Rainerle said:
@Alwin : How did you run three rados bench seq read clients???

You need to use a unique --run-name on each bench. This will store the metadata object with that name.

Rainerle · Oct 14, 2020

So here is something I do not understand:
Why is there less network traffic when doing sequential reads compared to the Ceph read bandwidth?

Had to take care of the RAM replacement scheduling during the tests. Unfortuntely all eight modules have to be replaced since support is not able to identify the broken one from these messages. It is always good to stress new hardware...

Code:

Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.540520] [Hardware Error]: Corrected error, no action required.
Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.541313] [Hardware Error]: CPU:3 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.541939] [Hardware Error]: Error Addr: 0x0000000e13c41f40
Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.542552] [Hardware Error]: IPID: 0x0000009600650f00, Syndrome: 0x1b6de27d0a800503
Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.543204] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Message from syslogd@proxmox04 at Oct 14 00:56:34 ...
 kernel:[39849.543820] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Benchmark: 3 node AMD EPYC 7742 64-Core, 512G RAM, 3x3 6,4TB Micron 9300 MAX NVMe

Renowned Member

Attachments

Renowned Member

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

We value your privacy