Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

Alwin · Oct 15, 2020

On the part of encryption, here are some numbers. They weren't included in the paper.

Ceph uses aes-xts for its LUKS encrypted device.

cryptsetup benchmark

Algorithm	Key	Encryption	Decryption
aes-xts	512b	2201.4 MiB/s	2180.1 MiB/s

And the results of 3x simultaneous rados bench.

rados bench 600 write -b 4M -t 16 --no-cleanup

	Single Namespace	Two Namespaces	Four Namespaces
Total time run	600.04	600.02	600.03
Total writes made	426,318.00	426,762.00	426,444.00
Write size	4,194,304.00	4,194,304.00	4,194,304.00
Object size	4,194,304.00	4,194,304.00	4,194,304.00
Bandwidth (MB/sec)	2,841.95	2,844.97	2,842.83
Stddev Bandwidth	19.18	23.57	23.95
Max bandwidth (MB/sec)	3,012.00	3,048.00	3,032.00
Min bandwidth (MB/sec)	2,600.00	2,584.00	2,588.00
Average IOPS	708.00	710.00	710.00
Stddev IOPS	4.80	5.89	5.99
Max IOPS	753.00	762.00	758.00
Min IOPS	650.00	646.00	647.00
Average Latency(s)	0.0676	0.0675	0.0675
Stddev Latency(s)	0.0185	0.0180	0.0180
Max latency(s)	0.2529	0.2586	0.2136
Min latency(s)	0.0149	0.0166	0.0155

rados bench 600 seq -t 16 (uses 4M from write)

	Single Namespace	Two Namespaces	Four Namespaces
Total time run	240.91	241.83	240.33
Total reads made	426,318.00	426,762.00	426,444.00
Read size	4,194,304.00	4,194,304.00	4,194,304.00
Object size	4,194,304.00	4,194,304.00	4,194,304.00
Bandwidth (MB/sec)	7,087.03	7,059.69	7,098.41
Average IOPS	1,771.00	1,763.00	1,773.00
Stddev IOPS	29.51	21.96	19.70
Max IOPS	2,132.00	2,066.00	2,070.00
Min IOPS	1,595.00	1,636.00	1,645.00
Average Latency(s)	0.0266	0.0266	0.0265
Max latency(s)	0.1713	0.1280	0.1211
Min latency(s)	0.0056	0.0056	0.0056

Very likely that the bigger Epyc CPUs may perform better under encryption.

velocity08 · Oct 18, 2020

Alwin said:
Stable performance. Further see the AMD Epyc tuning guides, they contain the BIOS settings in more detail.
https://developer.amd.com/resources/epyc-resources/epyc-tuning-guides/

Thanks @Alwin

which of the tuning guides do you recommend for this use case scenario?

””Cheers
G

Alwin · Oct 19, 2020

velocity08 said:
which of the tuning guides do you recommend for this use case scenario?

Go through all of them.

Each will have some information that might be useful to you.

velocity08 · Nov 14, 2020

Alwin said:
Go through all of them. Each will have some information that might be useful to you.

Thanks Alwin

there is a lot of documentation to go through it’s not a small task

did you notice any specific pointers for the single socket Epycs by any chance, we‘ve opted for the 7502p which are the single socket models.

I’ll try to slowly start going through the docs.

any pointers or comments would be greatly appreciate.

””Cheers
G

Byron · Dec 4, 2020

Alwin said:
This did not yield any benefit on that system. In atop you could observe that the write performance was divided by the namespaces. And for encryption, the Microns are faster then the aes-xts engine (with that cpu version). The rados bench tests maxed out at ~2.8 GB with any number of namespaces. I suspect it is the way the engine works on Epyc vs Xeon.

Does this depend on the type of drive or the CPU architecture or something else?

It would be super helpful to see CPU load during benchmarking, I'm considering using a 64C/128T single socket for 24 U.2 drives, at 5.3 threads per drive, would it be able to keep up?

Alwin · Dec 4, 2020

Byron said:
at 5.3 threads per drive

What do you mean by that?

Byron · Dec 4, 2020

Alwin said:
What do you mean by that?

We'd put 24 drives in a chassis with 128 CPU threads, giving a ratio of 5.33 CPU threads per NVMe. I've read before that 4 threads per NVMe is the recommended minimum, with 8+ seeming standard practice.

It looks like we'll run in to a CPU bottleneck as far as I can find, however it would be not really an issue to limit our servers to 16-20 drives if we start running into 100% CPU usage.

Any idea what the CPU load numbers were like?

Alwin · Dec 4, 2020

Byron said:
We'd put 24 drives in a chassis with 128 CPU threads, giving a ratio of 5.33 CPU threads per NVMe. I've read before that 4 threads per NVMe is the recommended minimum, with 8+ seeming standard practice.

It's not only the NVMe's that need IO, with all the VM/CT, networking and other services/hardware the threads will be used up quickly. It will be quite a trail and error till you come close to the optimum.

Byron said:
It looks like we'll run in to a CPU bottleneck as far as I can find, however it would be not really an issue to limit our servers to 16-20 drives if we start running into 100% CPU usage.

Probably the memory bandwidth will be the deciding factor, first. And then, in regards to Ceph, the network bandwidth will put an upper limit on the a system.

jsterr · Apr 29, 2021

How can I do multiple rados bench read tests from differen hosts? Seems like it only works from one host at the same time. Or maybe just reading works for last host written benchmark_data .. Any idea

?

Code:

root@pve02:~# rados bench 600 seq -t 16 -p vm_nvme
^[[3~hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
benchmark_data_pve02_2004918_object1 is not correct!
read got -2
error during benchmark: (2) No such file or directory
error 2: (2) No such file or directory
root@pve02:~# ^C
root@pve02:~#

Alwin Antreich · Apr 29, 2021

jsterr said:
How can I do multiple rados bench read tests from differen hosts? Seems like it only works from one host at the same time. Or maybe just reading works for last host written benchmark_data .. Any idea ?

With --run-name, as otherwise every rados bench tries to use the same name.

jsterr · May 3, 2021

Thanks Im currently doing the VM Single Performance on Windows but Im not really sure what parameters exactly where used to archieve the results (3458 IOPS) shown below? I saw the appendix but its not clear how all parameters where set exactly. Im asking because I only reach 1887 IOPS allthough my SN640 has quite same performance in single disk 4k-iops test then your micron 9300 max.

Are my parameters correct? It seems the threads parameter is kinda useless - it also says

fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.

Code:

[global]
ioengine=windowsaio
group_reporting
direct=1
sync=1
threads=4
numjobs=4
iodepth=1
directory=C\:\fio
size=9G
time_based
name=fio-win-seq-io-write
runtime=600

[seq-write]
rw=write
bs=4K
stonewall

I also have a question to the linux commands, seems like you map the rbd with rbd map and make tests with fio then directly to the the raw-device? So theres no fs on it? Using the command will erase the content of the vm disk right?

fio --ioengine=psync --filename=/dev/mapper/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1

Alwin Antreich · May 4, 2021

jsterr said:
Thanks Im currently doing the VM Single Performance on Windows but Im not really sure what parameters exactly where used to archieve the results (3458 IOPS) shown below? I saw the appendix but its not clear how all parameters where set exactly. Im asking because I only reach 1887 IOPS allthough my SN640 has quite same performance in single disk 4k-iops test then your micron 9300 max.

This depends on the VM config and how powerful CPU & memory are. See the last pages in the PDF.

jsterr said:
Are my parameters correct? It seems the threads parameter is kinda useless - it also says fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.

As it says mutex isn't supported under Windows and fio switches to threads instead of processes.

jsterr said:
I also have a question to the linux commands, seems like you map the rbd with rbd map and make tests with fio then directly to the the raw-device? So theres no fs on it? Using the command will erase the content of the vm disk right?

It's still in a VM, just that I used LVM instead of an empty partition. For Windows I couldn't find any difference when using the partition directly compared to the filesystem. So I made it easier setup and use a file instead of the partition.

chenjie · Jun 22, 2021

秒适合做超融合吗？

Rainerle · Jun 24, 2021

chenjie said:
秒适合做超融合吗？

Welche zweite?

sigmarb · Jul 11, 2021

Before going for expensive microns, i wanted to get some numbers:

NVMEs (cheap KINGSTON SNVS1000GB):

fio --ioengine=libaio --filename=test12345 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=10G
write: IOPS=434, BW=1739KiB/s (1781kB/s)(102MiB/60007msec); 0 zone resets

Is the performance (even for consumer NVMEs) that poor? I'm just a bit shocked.

Ryzen 7, 16 Cores
128 GB DDR4
NVME Gen3 in M2 Slots on board

spirit · Jul 11, 2021

sigmarb said:
Is the performance (even for consumer NVMEs) that poor? I'm just a bit shocked.

yes, consumer ssd/nvme sucks because of their poor (really poor) synchronous write, need for ceph journal. (same for zfs)

datacenter ssd/nvme have a supercapacitor, sync writes are buffered in disk cache, and in case of power failure, the datas are not lost.

sigmarb · Jul 11, 2021

spirit said:
yes, consumer ssd/nvme sucks because of their poor (really poor) synchronous write, need for ceph journal. (same for zfs)

datacenter ssd/nvme have a supercapacitor, sync writes are buffered in disk cache, and in case of power failure, the datas are not lost.

Thanks for beeing that clear. How can i make sure that a SSD/NVME has a supercapacitor on it?

jasonsansone · Jul 13, 2021

sigmarb said:
Thanks for beeing that clear. How can i make sure that a SSD/NVME has a supercapacitor on it?

It will be advertised as having PLP / Power Loss Protection. Check spec sheets.

Jesus Blanco · Jul 27, 2021

Hi,

If we make 4 OSD on each NVME disk, will we get more performance or this is irrelevant?

Thanks in advance.

Rainerle · Jul 28, 2021

Jesus Blanco said:
Hi,

If we make 4 OSD on each NVME disk, will we get more performance or this is irrelevant?

Thanks in advance.

I performance-tested from 1 to 4 OSDs per NVMe. It really depends on the system configuration - to drive more OSDs you need more CPU threads.
See this thread and the posts around there.

With my experience so far now I would just create one OSD per device. As Ceph uses a 4M "block size" I would rather test around changing the NVMe's blocksize from 512K to 4M.

Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

New Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Active Member

Renowned Member

We value your privacy