Proxmox VE Ceph Benchmark 2023/12 - Fast SSDs and network speeds in a Proxmox VE Ceph Reef cluster

martin · Dec 11, 2023

Current fast SSD disks provide great performance, and fast network cards are becoming more affordable. Hence, this is a good point to reevaluate how quickly different network setups for Ceph can be saturated depending on how many OSDs are present in each node.

This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox VE users in making better hardware purchasing decisions.

The hardware used for the benchmarks was a Proxmox VE Ceph HCI (RI2112) 3-node cluster assembled by Thomas Krenn, a leading European manufacturer of customized server and storage systems.

Download PDF
Proxmox VE Ceph Benchmark 2023/12

Benchmarks from 2020
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader

jsterr · Dec 11, 2023

Thanks for your trust and your work with Proxmox VE and Proxmox Backup Server

Nice paper, its good to have that many comparisons regarding the network in ceph.

maorbari · Dec 11, 2023

Where you runing the fio and rados commands? In VM or in the Nodes?

tom · Dec 11, 2023

maorbari said:
Where you runing the fio and rados commands? In VM or in the Nodes?

This is explained in the PDF.

tuxis · Dec 13, 2023

Thanks for this. Can you test if you see improvements when you run four OSD’s on a single disk?

davemcl · Dec 14, 2023

Thanks for this. Any chance you'll publish the BIOS tweaks like you did for the ZFS paper?

PmUserZFS · Dec 16, 2023

martin said:
Current fast SSD disks provide great performance, and fast network cards are becoming more affordable. Hence, this is a good point to reevaluate how quickly different network setups for Ceph can be saturated depending on how many OSDs are present in each node.

This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox VE users in making better hardware purchasing decisions.

The hardware used for the benchmarks was a Proxmox VE Ceph HCI (RI2112) 3-node cluster assembled by Thomas Krenn, a leading European manufacturer of customized server and storage systems.

Download PDF
Proxmox VE Ceph Benchmark 2023/12

Benchmarks from 2020
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader

I would love to see more latency numbers in the benchmark.

aaron · Dec 20, 2023

tuxis said:
Thanks for this. Can you test if you see improvements when you run four OSD’s on a single disk?

Will do that in the next days.

Update 2023-12-21: Did a benchmark run with 4 OSDs on one NVME with the 100 Gbit FRR network variant. The results for writing are in the same range of ~3100 MiB/s.
Reading does improve though as we measured just a little over 7000 MiB/s reading.
Keep in mind, though, that if you decide for such a setup, that the additional OSD services will consume additional CPU and memory resources!

davemcl said:
Thanks for this. Any chance you'll publish the BIOS tweaks like you did for the ZFS paper?

We didn't do much to the default BIOS settings. But I can document them if there is enough interest.

PmUserZFS said:
I would love to see more latency numbers in the benchmark.

I assume you mean the latencies seen during the benchmarks per second? Let me see how we can integrate that data into a new revision of the paper.

Dunuin · Dec 20, 2023

What I'm always missing are some unrecommended disks. All the people who are running consumer SSDs, QLC SSDs or SMR HDDs and wonder why the performance is terrible.
Would be great to also have those cases included as a bad example so people see what they will get when cheaping out on storage.

dhaux · Dec 20, 2023

Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.

aaron · Dec 22, 2023

Dunuin said:
What I'm always missing are some unrecommended disks. All the people who are running consumer SSDs, QLC SSDs or SMR HDDs and wonder why the performance is terrible.
Would be great to also have those cases included as a bad example so people see what they will get when cheaping out on storage.

I do understand the wish for that, but currently we do not plan to spend resources on bad test results. We rather focus to showcasing what can be achieved with good hardware.

dhaux said:
Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.

We did not collect that information initially. I could offer to run a single benchmark and monitor the performance on the other nodes and upload the log graphs. I would run a benchmark with 1 OSD/Node and 4 OSDs/Node.

quanto11 · Dec 29, 2023

Hey @aaron , would it be possible to test RDMA? This is a super popular and controversial topic that can be found in so many different forums, but without any meaningful benchmarks. Would it be possible for you to benchmark RDMA in mesh mode and just briefly report whether it is worth it or not?

aaron · Jan 3, 2024

dhaux said:
Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.

I got around to running the benchmarks again. Once with 1 OSD / node, and a second time with 4 OSDs / node on the 100 Gbit FRR network. The system performance data was gathered with Netdata and then exported to a CSV file.

Keep in mind that the nodes have 64 Threads when interpreting the overall CPU usage.

The resulting plots:

The spikes in CPU usage after each test run is when the pool is deleted and the mclock profile gets set to high_recovery_ops to clean up the old data quickly.
The memory graph in the 4 OSDs / node setup rises overall. I did not check for it, but I assume that the ZFS ARC is growing as well. The system is using ZFS for the OS. Because of this, take the memory usage on the first runs with a grain of salt!

aaron · Jan 3, 2024

quanto11 said:
Hey @aaron , would it be possible to test RDMA? This is a super popular and controversial topic that can be found in so many different forums, but without any meaningful benchmarks. Would it be possible for you to benchmark RDMA in mesh mode and just briefly report whether it is worth it or not?

I noted it down in my list of ideas for benchmarks

dhaux · Jan 3, 2024

aaron said:
Keep in mind that the nodes have 64 Threads when interpreting the overall CPU usage.

So in these charts 100% would be all cpu cores and threads are being used? So for the 4 OSD test looks like it used about 10% cpu, or a little over 6 threads.

Thanks! Really appreciate the data.

aaron · Jan 4, 2024

dhaux said:
So in these charts 100% would be all cpu cores and threads are being used? So for the 4 OSD test looks like it used about 10% cpu, or a little over 6 threads.

Thanks! Really appreciate the data.

Yep. If I would have fetched the data per thread, we would have seen which ones would see a high utilization. It also doesn't seem to be exactly linear, when you compare 4 OSDs/node with 1 OSD/node.

jlauro · Mar 1, 2024

Nice benchmark report. One question... Any benchmarks on random writes with small block size as was on 2020s? That tends to be where traditional SAN outperforms SDS such as CEPF, and is important for transactional database workloads and I didn't see any info for the 2023 run.

pille99 · Mar 3, 2024

thx for the paper. appreciate it. a couple of questions, if you allow

1. can you give more insight about the network config. i guess you used the mentioned NICs only for the ceph traffic. please write more in detail about your network config (as you have multiple NICs)
2. did you switch of logging for example, or did you do any special performance tweak (as i read a while ago that logging takes up to 10% of the performance)
3. could you do a comparison, speed and performance wise, between proxmox, xen and esx - ideally with the same hardware
4. did you set the PG fix ? as i understand you did not use the auto function, could you test with "auto PGs" too ? just to see the difference between fixed and auto

is the size of the ssd important ? for example: a server with 4x15 tb nvme and 4 OSD, does it perform worst or better or simelar as with 1.6 tb ?

aaron · Mar 4, 2024

pille99 said:
1. can you give more insight about the network config. i guess you used the mentioned NICs only for the ceph traffic. please write more in detail about your network config (as you have multiple NICs)

We have links to the full-mesh variants. They were used exactly as described in the links, except for the IP addresses.

pille99 said:
2. did you switch of logging for example, or did you do any special performance tweak (as i read a while ago that logging takes up to 10% of the performance)

As in section 2.4:

No special configurations were applied.

So no, everything was left at default values when you install Ceph via the Proxmox VE tooling.

pille99 said:
3. could you do a comparison, speed and performance wise, between proxmox, xen and esx - ideally with the same hardware

This is not planned from our side. We are no experts for the other products, so it would not be fair. Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?

pille99 said:
4. did you set the PG fix ? as i understand you did not use the auto function, could you test with "auto PGs" too ? just to see the difference between fixed and auto

The target ratio was set for the pool. Since it is the only pool (besides the .mgr which can be ignored for this with its 1 PG), the autoscaler assigned the pool the correct number of PGs for the amount of OSDs. If you already know which PG num is the correct one, you can define it when creating the pool.
By setting any target ratio (a weight) and the pool being the only one with one, you tell the autoscaler that this pool is expected to take up all the space in the cluster. The autoscaler is just doing what needed to be done manually before it exsited, calculating the right number of PGs for the pool(s) according to the space you estimate they are expected to consume in the cluster.

pille99 said:
is the size of the ssd important ? for example: a server with 4x15 tb nvme and 4 OSD, does it perform worst or better or simelar as with 1.6 tb ?

If you have more but smaller resources, the load can be balanced better and if one fails, not as much data needs to be recovered. The downside is, that each service takes up CPU and memory resources. And you might be limited by how many disks you can physically put into a server. With these and other constraints (budget, ...) you have to weigh what works best for you. There is no easy answer.

pille99 · Mar 4, 2024

aaron said:
Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?

This is not planned from our side. We are no experts for the other products, so it would not be fair. Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?

for xen (which is called xcp-ng) its called XOSTOR.

you could buy that resources from outsite. a pretty standard install should be done in max 3 hours. its more like many people consider a switch based on performance. this is a very strong argument and you guys should answer it. i am pretty sure, your customers will increase

a performance guide has only a value (at least i think so) if you have, under the same circumstances, a comparison between different products. there are some comparisons which states that proxmox is up to 30% faster as ESXi. but they did not test with ceph, vsan and oxstor (as far as i know, if somebody know more, plz share). in this i would be pretty interested (and i guess a lot of others too). the only backdraw: proxmox is often descript as "not a Enterprise" Solution. this is based on missing Enterprise tools like continuous logging (it can be done with some other solutions). Azure is based on KVM, as proxmox, and it hosts hundret thousands of VMs. my guess - they choose it because performance wise

anyway: keep up your good work. i love proxmox and the performance is outstanding

Proxmox VE Ceph Benchmark 2023/12 - Fast SSDs and network speeds in a Proxmox VE Ceph Reef cluster

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Famous Member

Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

We value your privacy