Proxmox VE Ceph Benchmark 2018/02

Hi,
you read the data, which you write before (from this node) to the pool - if you have read all available data, the benchmark stop.
Due to faster reading than writing, the job is done in 32 seconds.

Udo

Got it.

Is there any way to figure out what the bottleneck is in the above? (E.g. network, storage drives, or RAM etc) Or if we've hit some hard limitation in Ceph at this scale etc.
 
You reached your network limits, compare the results from our benchmark paper. To really get the IO/s out of your NVMe drives, you should consider upgrading to 40GbE or even 100GbE (3 nodes, no switch needed).

Any ideas on why the original transfer from LVM to Ceph stalled at 371 MiB/s?
Possibly due to the read limitation of your LVM storage, but this is just a shot in the dark.
 
To optimize performance in hyper-converged deployments with Proxmox VE and Ceph storage
hardware setup is an important factor.

This Ceph benchmark shows some examples. We are also curious to hear about your setups and performance outcomes, please post and discuss this here.

Download PDF
Proxmox VE Ceph Benchmark 2018/02
__________________
Best regards,

Martin Maurer
Proxmox VE project leader

Did anybody compare the SM883 with SM863?
Seems like SM863 is not available on the market anymore!

I guess performance is approximately the same because it is just a newer modell?

Thank you in advance

BR
 
and what is with the Samsung PM883 - any experience with this one?

regards
Ronny
 
Here the results of my last benchmarks:

Code:
Model        Size    TBW    BW        IOPS
Intel DC S4500    480GB    900TB    62,4 MB/s    15,0k
Samsung PM883    240GB    341TB    67,2 MB/s    17,2k
 
Is the next limitation of Ceph is true:
~10k IOPS per OSD
?

I've got 15k-16k random write IOPS in 16 threads with 4k blocks per bluestore OSD on 40G IB net, but that is the good result, 10k per OSD is not bad. With io-thread=1 I can get only 1400-1600 IOPS of random writes.
The problem here is the OSD code latencies an WA produced on every operation. OSD itself can take 700 μs (0,7ms) just to execute one IO operation, so even on RAM disk, where kernel IO operations in <10μs you can't barely reach 3k IOPS with a best high freq CPU.

P.S. Random read is about 35-50k IOPS on the same system, but all of them is just OSD performance mesure (data was read from OSD cache, when testing not disk IO is done)
 
I got less than expected results from sm883 (i thought they would be at least as good as SM863) so I ended up using PM963 and even more so PM983. I cannot tell really if it's the nvme or the ssd themselves but these are also considered to read intensive disks but i get much better performance and low latency compared to the sm883. I did not study very much but i suppose the SM 883 and 863 are MLC and the PM983 TLC byt nevertheless they work much better form me used by ceph. I am using the 1TB and one osd on each, no db or wal.
/Hans

Did anybody compare the SM883 with SM863?
Seems like SM863 is not available on the market anymore!

I guess performance is approximately the same because it is just a newer modell?

Thank you in advance

BR
 
I've got 15k-16k random write IOPS in 16 threads with 4k blocks per bluestore OSD on 40G IB net, but that is the good result, 10k per OSD is not bad. With io-thread=1 I can get only 1400-1600 IOPS of random writes.
The problem here is the OSD code latencies an WA produced on every operation. OSD itself can take 700 μs (0,7ms) just to execute one IO operation, so even on RAM disk, where kernel IO operations in <10μs you can't barely reach 3k IOPS with a best high freq CPU.

P.S. Random read is about 35-50k IOPS on the same system, but all of them is just OSD performance mesure (data was read from OSD cache, when testing not disk IO is done)

Thanks, but no this is not good... Currently single method to maximize utilization of NVMe device (for example Optane with ~500k iops rw with 4kb blocks) is to split more than 10-50 OSD per device. And of course we need wait implementation of io_uring: https://github.com/ceph/ceph/pull/27392
 
  • Like
Reactions: Alibek
And proxmox need ovs-dpdk too
Software with dpdk in mind needs to be recompiled for the software version used and therefore is not a stock solution.
 
Software with dpdk in mind needs to be recompiled for the software version used and therefore is not a stock solution.
https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview
performance-comparison-native-open-vswitch-ovs-and-ovs-with-data-plane-dev-kit.png
 
Hello,

I am a little confused by IOPS results in the Benchmark PDF, around 200 Write IOPS on 10Gb and around 300 IOPS on 100Gb/s net?

Then in this thread people talk about reaching thousands of Write IOPS on Ceph, what am I missing?

If I have 30 LXC that work with large number of small files better considel local SSD instead of Hyperconverged Ceph?

Thank you.
P.
 
I am a little confused by IOPS results in the Benchmark PDF, around 200 Write IOPS on 10Gb and around 300 IOPS on 100Gb/s net?
The IOps are for 4 MB objects and a runtime of 60 sec.

Then in this thread people talk about reaching thousands of Write IOPS on Ceph, what am I missing?
The rados bench (in these tests) is a single client with 16 threads, while a cluster usually has multiple concurrent clients accessing Ceph.

If I have 30 LXC that work with large number of small files better considel local SSD instead of Hyperconverged Ceph?
Possible this depends on the IOps and latency requirements of the workload. For Ceph, the containers us KRBD and have the page cache available to them. This can greatly increase the performance. But in all, you need to test and see which solution works better for the workload.
 
Hi all,
we have been benchmarking CEPHFS with the following configuration:
4 x Dell PowerEdge R730xd, 2 x CPU each with 96GB RAM each and 12 x 4TB HDD each, retail 10Gbps.

Then we built a container (*) with CentOS and ran our benchmarks on its local disk, which was a virtual disk within CEPHFS, and we already got impressive results. (we are already outperforming Nutanix for files up to 8MB, three times faster than a classic VMWare 6.5 with a Dell/EMC SAN.

We are now procuring SSDs to replace all our 48 HDDs and we will run our final tests and will publish them in here.

Our goal is to reach at least 600MB sustained writing for files up to 1TB to outperform our GPFS systems.

Question I have is:
We would like to mount (using the kernel module) CEPHFS within our CentOS 6.10 container and then run the benchmark on that mounted CEPHFS.
Could you please point us out on the documentation describing how to mount CEPHFS from ProxMox into a CentOS machine?

Thank you
Rosario

(*) side note. After restarting the cluster a couple of times we are no more able to access the console of the container. Any suggestion?
 
We would like to mount (using the kernel module) CEPHFS within our CentOS 6.10 container and then run the benchmark on that mounted CEPHFS.
Could you please point us out on the documentation describing how to mount CEPHFS from ProxMox into a CentOS machine?
I don't think a direct mount into a container will work, as the mount happens more than once. Better mount the CephFS on the node and use a mountpoint pointing to the CephFS storage.

(*) side note. After restarting the cluster a couple of times we are no more able to access the console of the container. Any suggestion?
Better open up a new thread for this, makes it more visible.
 
I don't think a direct mount into a container will work, as the mount happens more than once. Better mount the CephFS on the node and use a mountpoint pointing to the CephFS storage.

Hi Alwin, thanks for your prompt reply. Could you please elaborate a bit more on this? In our future scenario we would like to have CEPHFS mounted by several containers, each probably running on different nodes, each accessing CEPHFS (different paths though). How would you recommend us to do it?

EDIT
We are now procuring the SSDs you chosen for your benchmark. If you could give us bit more directions on how you recommend to access CEPHFS from containers that would be awesome. Thank you.
 
Last edited:
Still working on benchmarking CEPHFS, we noted the following different behavior:

a. A container with a virtual disk stored on CEPHFS, benchmark running on its local /tmp, bandwidth of approx 450MB/s
b. Same container with BIND MOUNT exported by host and benchmark running on this shared folder, bandwidth of approx 70MB/s

Any idea on why case b. is over 6 times slower than a. ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!