Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

martin · Sep 25, 2020

To optimize performance in hyper-converged deployments, with Proxmox VE and Ceph storage, the appropriate hardware setup is essential. This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox users in making better decisions.

Download PDF
Proxmox VE Ceph Benchmark 2020/09

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader

RokaKen · Sep 25, 2020

@martin thanks to you and your staff for producing the report. I just have a couple of questions:

1. I was under the impression that Ceph Octopus had significantly addressed the read performance penalty with "writeback" caching, but your results (for Linux VMs) indicate otherwise. Do you have any comment on that? Is there some additional tweaking of rbd_cache_policy and/or rbd_io_scheduler needed, perhaps?

2. The data charts for "MULTI-VM WORKLOAD (LINUX) --> RANDOM IO/S BY NUMBER OF JOBS" on page 11/22 do not reflect the statistics in the SUMMARY. Was this a cut/paste error (from page 10/22) maybe?

Alwin · Sep 25, 2020

RokaKen said:
2. The data charts for "MULTI-VM WORKLOAD (LINUX) --> RANDOM IO/S BY NUMBER OF JOBS" on page 11/22 do not reflect the statistics in the SUMMARY. Was this a cut/paste error (from page 10/22) maybe?

I noticed that as well. My copy & pasta.

But the charts are at least obvious, nothing special there.

RokaKen said:
1. I was under the impression that Ceph Octopus had significantly addressed the read performance penalty with "writeback" caching, but your results (for Linux VMs) indicate otherwise. Do you have any comment on that? Is there some additional tweaking of rbd_cache_policy and/or rbd_io_scheduler needed, perhaps?

Its the way I conducted the tests. As seen in the appendix 5.3, I used fio with a 9 GB size to read/write. The target device was a LV on the OS disk. With the default 32MiB rbd cache size there is not much that can be kept in memory. You can see the difference on page 14. There I added results from the run using 128 MiB rbd cache size.

But in general there are still things to examine.

EDIT: We uploaded a new PDF version, containing some fixes.
Changelog: https://forum.proxmox.com/threads/p...9-hyper-converged-with-nvme.76516/post-342581

So the last graphs changed, since I accidentally included the wrong numbers for 128MB librbd caching. Therefore my above statement is not correct.

dimi · Sep 28, 2020

Dear Proxmox Team,
and thanks for great work, i'm interested on the availability of the "stable" version of Ceph Octopus. Do you have some ETA? I saw the packages on Test repo but i prefer to upgrade when will be in "stable".
Thanks

Alwin · Sep 28, 2020

dimi said:
and thanks for great work, i'm interested on the availability of the "stable" version of Ceph Octopus. Do you have some ETA? I saw the packages on Test repo but i prefer to upgrade when will be in "stable".

Ceph Octopus on its own is as stable as it can get from upstream. We will push Octopus to main once we believe the integration in Proxmox VE has all the features you come to expect and no major bugs. In short, no ETA yet.

dimi · Sep 28, 2020

Ok thanks so much, i will wait

nivadis · Sep 29, 2020

Can you please explain in a little more detail why you changed the BIOS values on page 19?
Thanks.

Alwin · Sep 29, 2020

nivadis said:
Can you please explain in a little more detail why you changed the BIOS values on page 19?

Stable performance. Further see the AMD Epyc tuning guides, they contain the BIOS settings in more detail.
https://developer.amd.com/resources/epyc-resources/epyc-tuning-guides/

jsterr · Sep 30, 2020

Nice we are currently building a system based on amd epyc and u.2 nvmes. I will definitely use this paper for inspiration, thanks!

Have you thought about using 2 namespaces for each nvme like nvme0n1 and nvme0n2 (same nvme but twonamespaces) so you get two osd disks for each nvme? There is some discussion in web, that this should increase performance.

Alwin · Sep 30, 2020

jsterr said:
Have you thought about using 2 namespaces for each nvme like nvme0n1 and nvme0n2 (same nvme but twonamespaces) so you get two osd disks for each nvme? There is some discussion in web, that this should increase performance.

This did not yield any benefit on that system. In atop you could observe that the write performance was divided by the namespaces. And for encryption, the Microns are faster then the aes-xts engine (with that cpu version). The rados bench tests maxed out at ~2.8 GB with any number of namespaces. I suspect it is the way the engine works on Epyc vs Xeon.

Rainerle · Oct 2, 2020

Thanks for this second benchmark! It gives a clear impression what should be achievable with current hardware.

I am currently trying to run that exact benchmark setup on a three node cluster and have problems running three rados bench clients simultaneously.

Can you adjust the PDF and give instructions on how to run the three clients in parallel and how you combine the resulting numbers please?

An even better way would be to have a standard proxmox script(s) to run these benchmarks and get comparable numbers out within the proxmox distribution.

latosec · Oct 3, 2020

I'm building a similar cluster but with this hardware:
-6x Dell PowerEdge R6515 with AMD EPYC 7402P 24-Core Processor
-1 boss card with 2 m.2 240gb Raid1 for OS per host
-4 disks Kingston DC1000M U.2 7,68Tb per host
-8x Kingston KTD-PE432/64GRam 3200Mhz
-6 10Gbps sfp+ per hosts (1 Broadcom Adv. Dual 10G SFP+ Ethernet and 1 Intel XL710-BM1 Quad-Port 10G SFP+ PCIe)
-2 1Gbps per host ( Broadcom dual port Gigabit Ethernet BCM5720)
-4 switch cisco sg550xg-24f (1 dual stack isolated for ceph/cluster and 1 dual stack for vm traffic)

If Martin or Alwin autorize me I can "replicate" your document Proxmox VE Ceph Benchmark 2020/09 with my hardware, but I've 2 questions:

Assuming you have 4 10Gbps network cards available for ceph and cluster, what is the best solution of the following?
1) 1 nic for cluster, 1 nic for cluster rrp, 1 bond (2 nic) for ceph public and cluster
2) 1 "big bond" with 4 nic to have 4 vmbr (cluster, cluster rrp, ceph public and ceph cluster)

Assuming this solution must go in production in max 30 days, ceph nautilus or octopus?

In attachment you can see fio benchmark from my system for one Kingston DC1000M U.2 7,68Tb

Thank you for your work

Alwin · Oct 4, 2020

@latosec, --bs=4K--numjobs=1 there seems to be a space missing. And try a run with the psync engine, to see if it yields different results.

latosec said:
Kingston DC1000M U.2 7,68Tb

Kingston writes that the 7.68 TB should do 210,000 IO/s. ~4x more then your fio result. So they run fio in a different manner. But try to run it for 600 sec to see if ~58k IO/s are kept during longer runs, then you can see if there are no cache effects.

latosec said:
-6 10Gbps sfp+ per hosts (1 Broadcom Adv. Dual 10G SFP+ Ethernet and 1 Intel XL710-BM1 Quad-Port 10G SFP+ PCIe)

See the FAQ section of the paper, with 4x OSD per node, 60 Gb/s could be maxed out. Also the latency (important factor) is higher than with 100 GbE cards. I can only recommend the use of 100 GbE if multiple U.2 SSDs are in use. PCIe 3.0 does roughly ~15 GB/s, 4x SSDs with 2.5 GB/s are using already 2/3 of the available bandwidth of PCIe 3.0.

latosec said:
If Martin or Alwin autorize me I can "replicate" your document Proxmox VE Ceph Benchmark 2020/09 with my hardware, but I've 2 questions:

Not the document itself. But we are happy to hear about your test results. Best post them here for comparison and discussion.

latosec said:
1) 1 nic for cluster, 1 nic for cluster rrp, 1 bond (2 nic) for ceph public and cluster
2) 1 "big bond" with 4 nic to have 4 vmbr (cluster, cluster rrp, ceph public and ceph cluster)

I am not certain what you mean by cluster rrp, but I assume its for corosync. Never run corosync on a shared media. While bandwidth is not an issue, latency certainly will be. Corosync needs low and stable latency.

latosec · Oct 5, 2020

Before any other test I read again your benchmark to optimize bios settings.

Alwin said:
--bs=4K--numjobs=1 there seems to be a space missing. And try a run with the psync engine, to see if it yields different results.

Copy/paste error from putty to notepad.
Using psync results are a little lower (less then 5%)

Alwin said:
Kingston writes that the 7.68 TB should do 210,000 IO/s. ~4x more then your fio result. So they run fio in a different manner. But try to run it for 600 sec to see if ~58k IO/s are kept during longer runs, then you can see if there are no cache effects.

No cache effect with 600 seconds, bandwidth and iops are stable from start to end of fio test.
I noticed growing bandwidth and iops grow when I change --numjobs=*:

Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=260MiB/s][w=66.4k IOPS][eta 00m:00s]
Starting 2 processes
Jobs: 2 (f=2): [W(2)][100.0%][w=470MiB/s][w=120k IOPS][eta 00m:00s]
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=859MiB/s][w=220k IOPS][eta 00m:00s]
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][w=1444MiB/s][w=370k IOPS][eta 00m:00s]
Starting 16 processes
Jobs: 16 (f=16): [W(16)][100.0%][w=1817MiB/s][w=465k IOPS][eta 00m:00s]

I'm worried that what Kingston write is the result with multiple jobs or different iodep. How many "workers" use ceph per OSD? (i'm pretty sure this is the wrong question)

Alwin said:
See the FAQ section of the paper, with 4x OSD per node, 60 Gb/s could be maxed out. Also the latency (important factor) is higher than with 100 GbE cards. I can only recommend the use of 100 GbE if multiple U.2 SSDs are in use. PCIe 3.0 does roughly ~15 GB/s, 4x SSDs with 2.5 GB/s are using already 2/3 of the available bandwidth of PCIe 3.0.

Also budget is an important factor

. Reaching 20/30 Gbps will be a good goal for me.
Before production, i'll make tests with only 2 OSD per node and 4 10Gb nic.

Alwin said:
I am not certain what you mean by cluster rrp, but I assume its for corosync. Never run corosync on a shared media. While bandwidth is not an issue, latency certainly will be. Corosync needs low and stable latency.

RRP is redundant ring protocol for corosync. I think everything must be redundant in a virtualization environment (server, switch, power, etc..).
Bonds with LACP should help me in this, but i'll take a look at different latency.

nivadis · Oct 5, 2020

another question: on page 4 you say:
The Rados benchmark shows the read/write bandwidth of three rados bench clients, running simultaneously.
so are the values added together for the bars? or do you mean on all three nodes the bench is running and giving the same (displayed) values?

Thanks.

Alwin · Oct 6, 2020

nivadis said:
so are the values added together for the bars?

Yes.

martin · Oct 7, 2020

We just uploaded a new PDF version, containing some minor fixes.

Changes on rev1

page 3, fio command had long dash instead of short (--)
page 11, summary was a duplicate of page 10
page 14 & 15, subtitle had the word 'random', but graphs aren't only showing random IO
page 20, fio command had a long dash instead of a short

Rainerle · Oct 13, 2020

Hi,

currently I am running a three node Ceph benchmark here and try to follow your PDF.

Feedback and recommendations in my thread are welcome...

Rainerle · Oct 14, 2020

Alwin said:
And for encryption, the Microns are faster then the aes-xts engine (with that cpu version).

@Alwin , when you use encryption on the Microns, you are talking about the BIOS user password activated encryption, correct?

And the aes-xts engine you refer to is the "Encrypt OSD"-Checkbox/"ceph osd create --encrypted"-CLI switch, correct?

Alwin · Oct 14, 2020

Rainerle said:
@Alwin , when you use encryption on the Microns, you are talking about the BIOS user password activated encryption, correct?

And the aes-xts engine you refer to is the "Encrypt OSD"-Checkbox/"ceph osd create --encrypted"-CLI switch, correct?

No, in both cases I meant the OSD encryption provided by Ceph. Since the Epyc used in our tests only has 16c it could well be a limitation of the CPU.

Proxmox VE Ceph Benchmark 2020/09 - hyper-converged with NVMe

Proxmox Staff Member

Active Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Member

Attachments

Proxmox Retired Staff

Member

Member

Proxmox Retired Staff

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Retired Staff