Proxmox VE Ceph Benchmark 2023/12 - Fast SSDs and network speeds in a Proxmox VE Ceph Reef cluster

martin

Proxmox Staff Member
Staff member
Apr 28, 2005
754
1,719
223
Current fast SSD disks provide great performance, and fast network cards are becoming more affordable. Hence, this is a good point to reevaluate how quickly different network setups for Ceph can be saturated depending on how many OSDs are present in each node.

This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox VE users in making better hardware purchasing decisions.

The hardware used for the benchmarks was a Proxmox VE Ceph HCI (RI2112) 3-node cluster assembled by Thomas Krenn, a leading European manufacturer of customized server and storage systems.

Download PDF
Proxmox VE Ceph Benchmark 2023/12

Benchmarks from 2020
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader
 
Thanks for your trust and your work with Proxmox VE and Proxmox Backup Server :)
Nice paper, its good to have that many comparisons regarding the network in ceph.
 
Last edited:
  • Like
Reactions: Bent
Thanks for this. Can you test if you see improvements when you run four OSD’s on a single disk?
 
  • Like
Reactions: Felix.
Thanks for this. Any chance you'll publish the BIOS tweaks like you did for the ZFS paper?
 
Current fast SSD disks provide great performance, and fast network cards are becoming more affordable. Hence, this is a good point to reevaluate how quickly different network setups for Ceph can be saturated depending on how many OSDs are present in each node.

This benchmark presents possible setups and their performance outcomes, with the intention of supporting Proxmox VE users in making better hardware purchasing decisions.

The hardware used for the benchmarks was a Proxmox VE Ceph HCI (RI2112) 3-node cluster assembled by Thomas Krenn, a leading European manufacturer of customized server and storage systems.

Download PDF
Proxmox VE Ceph Benchmark 2023/12

Benchmarks from 2020
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Benchmarks from 2018
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

__________________
Best regards,

Martin Maurer
Proxmox VE project leader
I would love to see more latency numbers in the benchmark.
 
Thanks for this. Can you test if you see improvements when you run four OSD’s on a single disk?
Will do that in the next days.

Update 2023-12-21: Did a benchmark run with 4 OSDs on one NVME with the 100 Gbit FRR network variant. The results for writing are in the same range of ~3100 MiB/s.
Reading does improve though as we measured just a little over 7000 MiB/s reading.
Keep in mind, though, that if you decide for such a setup, that the additional OSD services will consume additional CPU and memory resources!

Thanks for this. Any chance you'll publish the BIOS tweaks like you did for the ZFS paper?
We didn't do much to the default BIOS settings. But I can document them if there is enough interest.

I would love to see more latency numbers in the benchmark.
I assume you mean the latencies seen during the benchmarks per second? Let me see how we can integrate that data into a new revision of the paper.
 
Last edited:
What I'm always missing are some unrecommended disks. All the people who are running consumer SSDs, QLC SSDs or SMR HDDs and wonder why the performance is terrible.
Would be great to also have those cases included as a bad example so people see what they will get when cheaping out on storage.
 
Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.
 
What I'm always missing are some unrecommended disks. All the people who are running consumer SSDs, QLC SSDs or SMR HDDs and wonder why the performance is terrible.
Would be great to also have those cases included as a bad example so people see what they will get when cheaping out on storage.
I do understand the wish for that, but currently we do not plan to spend resources on bad test results. We rather focus to showcasing what can be achieved with good hardware.

Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.
We did not collect that information initially. I could offer to run a single benchmark and monitor the performance on the other nodes and upload the log graphs. I would run a benchmark with 1 OSD/Node and 4 OSDs/Node.
 
Hey @aaron , would it be possible to test RDMA? This is a super popular and controversial topic that can be found in so many different forums, but without any meaningful benchmarks. Would it be possible for you to benchmark RDMA in mesh mode and just briefly report whether it is worth it or not?
 
Is it possible to show CPU and possibly memory usage on the other Ceph nodes during the tests? It would be interesting to know the overhead for hyper-converged deployments.
I got around to running the benchmarks again. Once with 1 OSD / node, and a second time with 4 OSDs / node on the 100 Gbit FRR network. The system performance data was gathered with Netdata and then exported to a CSV file.

Keep in mind that the nodes have 64 Threads when interpreting the overall CPU usage.

The resulting plots:
ceph-bench-2023-system-usage-single-osd.png
ceph-bench-2023-system-usage-4-osds.png
The spikes in CPU usage after each test run is when the pool is deleted and the mclock profile gets set to high_recovery_ops to clean up the old data quickly.
The memory graph in the 4 OSDs / node setup rises overall. I did not check for it, but I assume that the ZFS ARC is growing as well. The system is using ZFS for the OS. Because of this, take the memory usage on the first runs with a grain of salt!
 
  • Like
Reactions: sb-jw and Dunuin
Hey @aaron , would it be possible to test RDMA? This is a super popular and controversial topic that can be found in so many different forums, but without any meaningful benchmarks. Would it be possible for you to benchmark RDMA in mesh mode and just briefly report whether it is worth it or not?
I noted it down in my list of ideas for benchmarks :)
 
  • Like
Reactions: quanto11 and jsterr
Keep in mind that the nodes have 64 Threads when interpreting the overall CPU usage.

So in these charts 100% would be all cpu cores and threads are being used? So for the 4 OSD test looks like it used about 10% cpu, or a little over 6 threads.

Thanks! Really appreciate the data.
 
So in these charts 100% would be all cpu cores and threads are being used? So for the 4 OSD test looks like it used about 10% cpu, or a little over 6 threads.

Thanks! Really appreciate the data.
Yep. If I would have fetched the data per thread, we would have seen which ones would see a high utilization. It also doesn't seem to be exactly linear, when you compare 4 OSDs/node with 1 OSD/node.
 
Nice benchmark report. One question... Any benchmarks on random writes with small block size as was on 2020s? That tends to be where traditional SAN outperforms SDS such as CEPF, and is important for transactional database workloads and I didn't see any info for the 2023 run.
 
thx for the paper. appreciate it. a couple of questions, if you allow

1. can you give more insight about the network config. i guess you used the mentioned NICs only for the ceph traffic. please write more in detail about your network config (as you have multiple NICs)
2. did you switch of logging for example, or did you do any special performance tweak (as i read a while ago that logging takes up to 10% of the performance)
3. could you do a comparison, speed and performance wise, between proxmox, xen and esx - ideally with the same hardware
4. did you set the PG fix ? as i understand you did not use the auto function, could you test with "auto PGs" too ? just to see the difference between fixed and auto

is the size of the ssd important ? for example: a server with 4x15 tb nvme and 4 OSD, does it perform worst or better or simelar as with 1.6 tb ?
 
1. can you give more insight about the network config. i guess you used the mentioned NICs only for the ceph traffic. please write more in detail about your network config (as you have multiple NICs)
We have links to the full-mesh variants. They were used exactly as described in the links, except for the IP addresses.

2. did you switch of logging for example, or did you do any special performance tweak (as i read a while ago that logging takes up to 10% of the performance)
As in section 2.4:
No special configurations were applied.
So no, everything was left at default values when you install Ceph via the Proxmox VE tooling.

3. could you do a comparison, speed and performance wise, between proxmox, xen and esx - ideally with the same hardware
This is not planned from our side. We are no experts for the other products, so it would not be fair. Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?

4. did you set the PG fix ? as i understand you did not use the auto function, could you test with "auto PGs" too ? just to see the difference between fixed and auto
The target ratio was set for the pool. Since it is the only pool (besides the .mgr which can be ignored for this with its 1 PG), the autoscaler assigned the pool the correct number of PGs for the amount of OSDs. If you already know which PG num is the correct one, you can define it when creating the pool.
By setting any target ratio (a weight) and the pool being the only one with one, you tell the autoscaler that this pool is expected to take up all the space in the cluster. The autoscaler is just doing what needed to be done manually before it exsited, calculating the right number of PGs for the pool(s) according to the space you estimate they are expected to consume in the cluster.

is the size of the ssd important ? for example: a server with 4x15 tb nvme and 4 OSD, does it perform worst or better or simelar as with 1.6 tb ?
If you have more but smaller resources, the load can be balanced better and if one fails, not as much data needs to be recovered. The downside is, that each service takes up CPU and memory resources. And you might be limited by how many disks you can physically put into a server. With these and other constraints (budget, ...) you have to weigh what works best for you. There is no easy answer.
 
Last edited:
  • Like
Reactions: herzkerl and jsterr
Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?


This is not planned from our side. We are no experts for the other products, so it would not be fair. Plus, does Xen currently even have an HCI storage option that could use the hardware in the same configuration?

for xen (which is called xcp-ng) its called XOSTOR.

you could buy that resources from outsite. a pretty standard install should be done in max 3 hours. its more like many people consider a switch based on performance. this is a very strong argument and you guys should answer it. i am pretty sure, your customers will increase

a performance guide has only a value (at least i think so) if you have, under the same circumstances, a comparison between different products. there are some comparisons which states that proxmox is up to 30% faster as ESXi. but they did not test with ceph, vsan and oxstor (as far as i know, if somebody know more, plz share). in this i would be pretty interested (and i guess a lot of others too). the only backdraw: proxmox is often descript as "not a Enterprise" Solution. this is based on missing Enterprise tools like continuous logging (it can be done with some other solutions). Azure is based on KVM, as proxmox, and it hosts hundret thousands of VMs. my guess - they choose it because performance wise

anyway: keep up your good work. i love proxmox and the performance is outstanding
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!