Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

On the part of encryption, here are some numbers. They weren't included in the paper.

Ceph uses aes-xts for its LUKS encrypted device.
cryptsetup benchmark
AlgorithmKeyEncryptionDecryption
aes-xts512b2201.4 MiB/s 2180.1 MiB/s

And the results of 3x simultaneous rados bench.
rados bench 600 write -b 4M -t 16 --no-cleanup
Single Namespace
Two Namespaces
Four Namespaces
Total time run
600.04​
600.02​
600.03​
Total writes made
426,318.00​
426,762.00​
426,444.00​
Write size
4,194,304.00​
4,194,304.00​
4,194,304.00​
Object size
4,194,304.00​
4,194,304.00​
4,194,304.00​
Bandwidth (MB/sec)
2,841.95​
2,844.97​
2,842.83​
Stddev Bandwidth
19.18​
23.57​
23.95​
Max bandwidth (MB/sec)
3,012.00​
3,048.00​
3,032.00​
Min bandwidth (MB/sec)
2,600.00​
2,584.00​
2,588.00​
Average IOPS
708.00​
710.00​
710.00​
Stddev IOPS
4.80​
5.89​
5.99​
Max IOPS
753.00​
762.00​
758.00​
Min IOPS
650.00​
646.00​
647.00​
Average Latency(s)
0.0676​
0.0675​
0.0675​
Stddev Latency(s)
0.0185​
0.0180​
0.0180​
Max latency(s)
0.2529​
0.2586​
0.2136​
Min latency(s)
0.0149​
0.0166​
0.0155​

rados bench 600 seq -t 16 (uses 4M from write)
Single Namespace
Two Namespaces
Four Namespaces
Total time run
240.91​
241.83​
240.33​
Total reads made
426,318.00​
426,762.00​
426,444.00​
Read size
4,194,304.00​
4,194,304.00​
4,194,304.00​
Object size
4,194,304.00​
4,194,304.00​
4,194,304.00​
Bandwidth (MB/sec)
7,087.03​
7,059.69​
7,098.41​
Average IOPS
1,771.00​
1,763.00​
1,773.00​
Stddev IOPS
29.51​
21.96​
19.70​
Max IOPS
2,132.00​
2,066.00​
2,070.00​
Min IOPS
1,595.00​
1,636.00​
1,645.00​
Average Latency(s)
0.0266​
0.0266​
0.0265​
Max latency(s)
0.1713​
0.1280​
0.1211​
Min latency(s)
0.0056​
0.0056​
0.0056​

1602752011937.png
Very likely that the bigger Epyc CPUs may perform better under encryption.
 
which of the tuning guides do you recommend for this use case scenario?
Go through all of them. ;) Each will have some information that might be useful to you.
 
Go through all of them. ;) Each will have some information that might be useful to you.
Thanks Alwin

there is a lot of documentation to go through it’s not a small task :(

did you notice any specific pointers for the single socket Epycs by any chance, we‘ve opted for the 7502p which are the single socket models.

I’ll try to slowly start going through the docs.

any pointers or comments would be greatly appreciate.

””Cheers
G
 
This did not yield any benefit on that system. In atop you could observe that the write performance was divided by the namespaces. And for encryption, the Microns are faster then the aes-xts engine (with that cpu version). The rados bench tests maxed out at ~2.8 GB with any number of namespaces. I suspect it is the way the engine works on Epyc vs Xeon.
Does this depend on the type of drive or the CPU architecture or something else?

It would be super helpful to see CPU load during benchmarking, I'm considering using a 64C/128T single socket for 24 U.2 drives, at 5.3 threads per drive, would it be able to keep up?
 
What do you mean by that?
We'd put 24 drives in a chassis with 128 CPU threads, giving a ratio of 5.33 CPU threads per NVMe. I've read before that 4 threads per NVMe is the recommended minimum, with 8+ seeming standard practice.

It looks like we'll run in to a CPU bottleneck as far as I can find, however it would be not really an issue to limit our servers to 16-20 drives if we start running into 100% CPU usage.

Any idea what the CPU load numbers were like?
 
We'd put 24 drives in a chassis with 128 CPU threads, giving a ratio of 5.33 CPU threads per NVMe. I've read before that 4 threads per NVMe is the recommended minimum, with 8+ seeming standard practice.
It's not only the NVMe's that need IO, with all the VM/CT, networking and other services/hardware the threads will be used up quickly. It will be quite a trail and error till you come close to the optimum.

It looks like we'll run in to a CPU bottleneck as far as I can find, however it would be not really an issue to limit our servers to 16-20 drives if we start running into 100% CPU usage.
Probably the memory bandwidth will be the deciding factor, first. And then, in regards to Ceph, the network bandwidth will put an upper limit on the a system.
 
Last edited:
How can I do multiple rados bench read tests from differen hosts? Seems like it only works from one host at the same time. Or maybe just reading works for last host written benchmark_data .. Any idea :-) ?

Code:
root@pve02:~# rados bench 600 seq -t 16 -p vm_nvme
^[[3~hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
benchmark_data_pve02_2004918_object1 is not correct!
read got -2
error during benchmark: (2) No such file or directory
error 2: (2) No such file or directory
root@pve02:~# ^C
root@pve02:~#
 
Last edited:
How can I do multiple rados bench read tests from differen hosts? Seems like it only works from one host at the same time. Or maybe just reading works for last host written benchmark_data .. Any idea :) ?
With --run-name, as otherwise every rados bench tries to use the same name.
 
  • Like
Reactions: Rainerle and jsterr
Thanks Im currently doing the VM Single Performance on Windows but Im not really sure what parameters exactly where used to archieve the results (3458 IOPS) shown below? I saw the appendix but its not clear how all parameters where set exactly. Im asking because I only reach 1887 IOPS allthough my SN640 has quite same performance in single disk 4k-iops test then your micron 9300 max.

Are my parameters correct? It seems the threads parameter is kinda useless - it also says fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.

Code:
[global]
ioengine=windowsaio
group_reporting
direct=1
sync=1
threads=4
numjobs=4
iodepth=1
directory=C\:\fio
size=9G
time_based
name=fio-win-seq-io-write
runtime=600

[seq-write]
rw=write
bs=4K
stonewall

1620030243282.png


I also have a question to the linux commands, seems like you map the rbd with rbd map and make tests with fio then directly to the the raw-device? So theres no fs on it? Using the command will erase the content of the vm disk right?

fio --ioengine=psync --filename=/dev/mapper/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1
 
Last edited:
Thanks Im currently doing the VM Single Performance on Windows but Im not really sure what parameters exactly where used to archieve the results (3458 IOPS) shown below? I saw the appendix but its not clear how all parameters where set exactly. Im asking because I only reach 1887 IOPS allthough my SN640 has quite same performance in single disk 4k-iops test then your micron 9300 max.
This depends on the VM config and how powerful CPU & memory are. See the last pages in the PDF.

Are my parameters correct? It seems the threads parameter is kinda useless - it also says fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
As it says mutex isn't supported under Windows and fio switches to threads instead of processes.

I also have a question to the linux commands, seems like you map the rbd with rbd map and make tests with fio then directly to the the raw-device? So theres no fs on it? Using the command will erase the content of the vm disk right?
It's still in a VM, just that I used LVM instead of an empty partition. For Windows I couldn't find any difference when using the partition directly compared to the filesystem. So I made it easier setup and use a file instead of the partition.
 
  • Like
Reactions: Rainerle
Before going for expensive microns, i wanted to get some numbers:

NVMEs (cheap KINGSTON SNVS1000GB):

fio --ioengine=libaio --filename=test12345 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=10G
write: IOPS=434, BW=1739KiB/s (1781kB/s)(102MiB/60007msec); 0 zone resets

Is the performance (even for consumer NVMEs) that poor? I'm just a bit shocked.

Ryzen 7, 16 Cores
128 GB DDR4
NVME Gen3 in M2 Slots on board
 
Is the performance (even for consumer NVMEs) that poor? I'm just a bit shocked.
yes, consumer ssd/nvme sucks because of their poor (really poor) synchronous write, need for ceph journal. (same for zfs)

datacenter ssd/nvme have a supercapacitor, sync writes are buffered in disk cache, and in case of power failure, the datas are not lost.
 
Last edited:
  • Like
Reactions: anzigo and sigmarb
yes, consumer ssd/nvme sucks because of their poor (really poor) synchronous write, need for ceph journal. (same for zfs)

datacenter ssd/nvme have a supercapacitor, sync writes are buffered in disk cache, and in case of power failure, the datas are not lost.
Thanks for beeing that clear. How can i make sure that a SSD/NVME has a supercapacitor on it?
 
Hi,

If we make 4 OSD on each NVME disk, will we get more performance or this is irrelevant?

Thanks in advance.
I performance-tested from 1 to 4 OSDs per NVMe. It really depends on the system configuration - to drive more OSDs you need more CPU threads.
See this thread and the posts around there.

With my experience so far now I would just create one OSD per device. As Ceph uses a 4M "block size" I would rather test around changing the NVMe's blocksize from 512K to 4M.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!