Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

martin

Proxmox Staff Member
Staff member
Apr 28, 2005
754
1,741
223
To optimize performance in hyper-converged deployments, with Proxmox VE and Ceph storage, the appropriate hardware setup is essential. This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox users in making better decisions.

rados-bench.png

Download PDF
Proxmox VE Ceph Benchmark 2020/09

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader
 
@martin thanks to you and your staff for producing the report. I just have a couple of questions:

1. I was under the impression that Ceph Octopus had significantly addressed the read performance penalty with "writeback" caching, but your results (for Linux VMs) indicate otherwise. Do you have any comment on that? Is there some additional tweaking of rbd_cache_policy and/or rbd_io_scheduler needed, perhaps?

2. The data charts for "MULTI-VM WORKLOAD (LINUX) --> RANDOM IO/S BY NUMBER OF JOBS" on page 11/22 do not reflect the statistics in the SUMMARY. Was this a cut/paste error (from page 10/22) maybe?
 
2. The data charts for "MULTI-VM WORKLOAD (LINUX) --> RANDOM IO/S BY NUMBER OF JOBS" on page 11/22 do not reflect the statistics in the SUMMARY. Was this a cut/paste error (from page 10/22) maybe?
I noticed that as well. My copy & pasta. ;)
But the charts are at least obvious, nothing special there.

1. I was under the impression that Ceph Octopus had significantly addressed the read performance penalty with "writeback" caching, but your results (for Linux VMs) indicate otherwise. Do you have any comment on that? Is there some additional tweaking of rbd_cache_policy and/or rbd_io_scheduler needed, perhaps?
Its the way I conducted the tests. As seen in the appendix 5.3, I used fio with a 9 GB size to read/write. The target device was a LV on the OS disk. With the default 32MiB rbd cache size there is not much that can be kept in memory. You can see the difference on page 14. There I added results from the run using 128 MiB rbd cache size.

But in general there are still things to examine. :)

EDIT: We uploaded a new PDF version, containing some fixes.
Changelog: https://forum.proxmox.com/threads/p...9-hyper-converged-with-nvme.76516/post-342581

So the last graphs changed, since I accidentally included the wrong numbers for 128MB librbd caching. Therefore my above statement is not correct.
 
Last edited:
  • Like
Reactions: RokaKen
Dear Proxmox Team,
and thanks for great work, i'm interested on the availability of the "stable" version of Ceph Octopus. Do you have some ETA? I saw the packages on Test repo but i prefer to upgrade when will be in "stable".
Thanks
 
and thanks for great work, i'm interested on the availability of the "stable" version of Ceph Octopus. Do you have some ETA? I saw the packages on Test repo but i prefer to upgrade when will be in "stable".
Ceph Octopus on its own is as stable as it can get from upstream. We will push Octopus to main once we believe the integration in Proxmox VE has all the features you come to expect and no major bugs. In short, no ETA yet.
 
  • Like
Reactions: gurubert
Nice we are currently building a system based on amd epyc and u.2 nvmes. I will definitely use this paper for inspiration, thanks!

Have you thought about using 2 namespaces for each nvme like nvme0n1 and nvme0n2 (same nvme but twonamespaces) so you get two osd disks for each nvme? There is some discussion in web, that this should increase performance.
 
Have you thought about using 2 namespaces for each nvme like nvme0n1 and nvme0n2 (same nvme but twonamespaces) so you get two osd disks for each nvme? There is some discussion in web, that this should increase performance.
This did not yield any benefit on that system. In atop you could observe that the write performance was divided by the namespaces. And for encryption, the Microns are faster then the aes-xts engine (with that cpu version). The rados bench tests maxed out at ~2.8 GB with any number of namespaces. I suspect it is the way the engine works on Epyc vs Xeon.
 
  • Like
Reactions: jsterr
Thanks for this second benchmark! It gives a clear impression what should be achievable with current hardware.

I am currently trying to run that exact benchmark setup on a three node cluster and have problems running three rados bench clients simultaneously.

Can you adjust the PDF and give instructions on how to run the three clients in parallel and how you combine the resulting numbers please?

An even better way would be to have a standard proxmox script(s) to run these benchmarks and get comparable numbers out within the proxmox distribution.
 
I'm building a similar cluster but with this hardware:
-6x Dell PowerEdge R6515 with AMD EPYC 7402P 24-Core Processor
-1 boss card with 2 m.2 240gb Raid1 for OS per host
-4 disks Kingston DC1000M U.2 7,68Tb per host
-8x Kingston KTD-PE432/64GRam 3200Mhz
-6 10Gbps sfp+ per hosts (1 Broadcom Adv. Dual 10G SFP+ Ethernet and 1 Intel XL710-BM1 Quad-Port 10G SFP+ PCIe)
-2 1Gbps per host ( Broadcom dual port Gigabit Ethernet BCM5720)
-4 switch cisco sg550xg-24f (1 dual stack isolated for ceph/cluster and 1 dual stack for vm traffic)

If Martin or Alwin autorize me I can "replicate" your document Proxmox VE Ceph Benchmark 2020/09 with my hardware, but I've 2 questions:

Assuming you have 4 10Gbps network cards available for ceph and cluster, what is the best solution of the following?
1) 1 nic for cluster, 1 nic for cluster rrp, 1 bond (2 nic) for ceph public and cluster
2) 1 "big bond" with 4 nic to have 4 vmbr (cluster, cluster rrp, ceph public and ceph cluster)

Assuming this solution must go in production in max 30 days, ceph nautilus or octopus?

In attachment you can see fio benchmark from my system for one Kingston DC1000M U.2 7,68Tb

Thank you for your work
 

Attachments

Last edited:
@latosec, --bs=4K--numjobs=1 there seems to be a space missing. And try a run with the psync engine, to see if it yields different results.

Kingston DC1000M U.2 7,68Tb
Kingston writes that the 7.68 TB should do 210,000 IO/s. ~4x more then your fio result. So they run fio in a different manner. But try to run it for 600 sec to see if ~58k IO/s are kept during longer runs, then you can see if there are no cache effects.

-6 10Gbps sfp+ per hosts (1 Broadcom Adv. Dual 10G SFP+ Ethernet and 1 Intel XL710-BM1 Quad-Port 10G SFP+ PCIe)
See the FAQ section of the paper, with 4x OSD per node, 60 Gb/s could be maxed out. Also the latency (important factor) is higher than with 100 GbE cards. I can only recommend the use of 100 GbE if multiple U.2 SSDs are in use. PCIe 3.0 does roughly ~15 GB/s, 4x SSDs with 2.5 GB/s are using already 2/3 of the available bandwidth of PCIe 3.0.

If Martin or Alwin autorize me I can "replicate" your document Proxmox VE Ceph Benchmark 2020/09 with my hardware, but I've 2 questions:
Not the document itself. But we are happy to hear about your test results. Best post them here for comparison and discussion.

1) 1 nic for cluster, 1 nic for cluster rrp, 1 bond (2 nic) for ceph public and cluster
2) 1 "big bond" with 4 nic to have 4 vmbr (cluster, cluster rrp, ceph public and ceph cluster)
I am not certain what you mean by cluster rrp, but I assume its for corosync. Never run corosync on a shared media. While bandwidth is not an issue, latency certainly will be. Corosync needs low and stable latency.
 
Before any other test I read again your benchmark to optimize bios settings.

--bs=4K--numjobs=1 there seems to be a space missing. And try a run with the psync engine, to see if it yields different results.
Copy/paste error from putty to notepad.
Using psync results are a little lower (less then 5%)

Kingston writes that the 7.68 TB should do 210,000 IO/s. ~4x more then your fio result. So they run fio in a different manner. But try to run it for 600 sec to see if ~58k IO/s are kept during longer runs, then you can see if there are no cache effects.

No cache effect with 600 seconds, bandwidth and iops are stable from start to end of fio test.
I noticed growing bandwidth and iops grow when I change --numjobs=*:

Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=260MiB/s][w=66.4k IOPS][eta 00m:00s]
Starting 2 processes
Jobs: 2 (f=2): [W(2)][100.0%][w=470MiB/s][w=120k IOPS][eta 00m:00s]
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=859MiB/s][w=220k IOPS][eta 00m:00s]
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][w=1444MiB/s][w=370k IOPS][eta 00m:00s]
Starting 16 processes
Jobs: 16 (f=16): [W(16)][100.0%][w=1817MiB/s][w=465k IOPS][eta 00m:00s]

I'm worried that what Kingston write is the result with multiple jobs or different iodep. How many "workers" use ceph per OSD? (i'm pretty sure this is the wrong question)

See the FAQ section of the paper, with 4x OSD per node, 60 Gb/s could be maxed out. Also the latency (important factor) is higher than with 100 GbE cards. I can only recommend the use of 100 GbE if multiple U.2 SSDs are in use. PCIe 3.0 does roughly ~15 GB/s, 4x SSDs with 2.5 GB/s are using already 2/3 of the available bandwidth of PCIe 3.0.

Also budget is an important factor :). Reaching 20/30 Gbps will be a good goal for me.
Before production, i'll make tests with only 2 OSD per node and 4 10Gb nic.

I am not certain what you mean by cluster rrp, but I assume its for corosync. Never run corosync on a shared media. While bandwidth is not an issue, latency certainly will be. Corosync needs low and stable latency.
RRP is redundant ring protocol for corosync. I think everything must be redundant in a virtualization environment (server, switch, power, etc..).
Bonds with LACP should help me in this, but i'll take a look at different latency.
 
another question: on page 4 you say:
The Rados benchmark shows the read/write bandwidth of three rados bench clients, running simultaneously.
so are the values added together for the bars? or do you mean on all three nodes the bench is running and giving the same (displayed) values?

Thanks.
 
We just uploaded a new PDF version, containing some minor fixes.

Changes on rev1
  • page 3, fio command had long dash instead of short (--)
  • page 11, summary was a duplicate of page 10
  • page 14 & 15, subtitle had the word 'random', but graphs aren't only showing random IO
  • page 20, fio command had a long dash instead of a short
 
  • Like
Reactions: jsterr
And for encryption, the Microns are faster then the aes-xts engine (with that cpu version).
@Alwin , when you use encryption on the Microns, you are talking about the BIOS user password activated encryption, correct?

And the aes-xts engine you refer to is the "Encrypt OSD"-Checkbox/"ceph osd create --encrypted"-CLI switch, correct?
 
@Alwin , when you use encryption on the Microns, you are talking about the BIOS user password activated encryption, correct?

And the aes-xts engine you refer to is the "Encrypt OSD"-Checkbox/"ceph osd create --encrypted"-CLI switch, correct?
No, in both cases I meant the OSD encryption provided by Ceph. Since the Epyc used in our tests only has 16c it could well be a limitation of the CPU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!