Proxmox VE Ceph Benchmark 2018/02

udo · Mar 8, 2019

victorhooi said:
Sequential read results - I don't know why this only ran for 32 seconds?

Hi,
you read the data, which you write before (from this node) to the pool - if you have read all available data, the benchmark stop.
Due to faster reading than writing, the job is done in 32 seconds.

Udo

victorhooi · Mar 9, 2019

udo said:
Hi,
you read the data, which you write before (from this node) to the pool - if you have read all available data, the benchmark stop.
Due to faster reading than writing, the job is done in 32 seconds.

Udo

Got it.

Is there any way to figure out what the bottleneck is in the above? (E.g. network, storage drives, or RAM etc) Or if we've hit some hard limitation in Ceph at this scale etc.

Alwin · Mar 11, 2019

You reached your network limits, compare the results from our benchmark paper. To really get the IO/s out of your NVMe drives, you should consider upgrading to 40GbE or even 100GbE (3 nodes, no switch needed).

victorhooi said:
Any ideas on why the original transfer from LVM to Ceph stalled at 371 MiB/s?

Possibly due to the read limitation of your LVM storage, but this is just a shot in the dark.

Alexander Marek · Mar 12, 2019

martin said:
To optimize performance in hyper-converged deployments with Proxmox VE and Ceph storage
hardware setup is an important factor.

This Ceph benchmark shows some examples. We are also curious to hear about your setups and performance outcomes, please post and discuss this here.

Download PDF
Proxmox VE Ceph Benchmark 2018/02
__________________
Best regards,

Martin Maurer
Proxmox VE project leader

Did anybody compare the SM883 with SM863?
Seems like SM863 is not available on the market anymore!

I guess performance is approximately the same because it is just a newer modell?

Thank you in advance

BR

Ronny · Apr 1, 2019

and what is with the Samsung PM883 - any experience with this one?

regards
Ronny

fips · Apr 2, 2019

Here the results of my last benchmarks:

Code:

Model        Size    TBW    BW        IOPS
Intel DC S4500    480GB    900TB    62,4 MB/s    15,0k
Samsung PM883    240GB    341TB    67,2 MB/s    17,2k

Alibek · Apr 5, 2019

Is the next limitation of Ceph is true:
~10k IOPS per OSD
?

Tacid · Apr 26, 2019

Alibek said:
Is the next limitation of Ceph is true:
~10k IOPS per OSD
?

I've got 15k-16k random write IOPS in 16 threads with 4k blocks per bluestore OSD on 40G IB net, but that is the good result, 10k per OSD is not bad. With io-thread=1 I can get only 1400-1600 IOPS of random writes.
The problem here is the OSD code latencies an WA produced on every operation. OSD itself can take 700 μs (0,7ms) just to execute one IO operation, so even on RAM disk, where kernel IO operations in <10μs you can't barely reach 3k IOPS with a best high freq CPU.

P.S. Random read is about 35-50k IOPS on the same system, but all of them is just OSD performance mesure (data was read from OSD cache, when testing not disk IO is done)

oversite · May 7, 2019

I got less than expected results from sm883 (i thought they would be at least as good as SM863) so I ended up using PM963 and even more so PM983. I cannot tell really if it's the nvme or the ssd themselves but these are also considered to read intensive disks but i get much better performance and low latency compared to the sm883. I did not study very much but i suppose the SM 883 and 863 are MLC and the PM983 TLC byt nevertheless they work much better form me used by ceph. I am using the 1TB and one osd on each, no db or wal.
/Hans

Alexander Marek said:
Did anybody compare the SM883 with SM863?
Seems like SM863 is not available on the market anymore!

I guess performance is approximately the same because it is just a newer modell?

Thank you in advance

BR

Alibek · May 8, 2019

Tacid said:
I've got 15k-16k random write IOPS in 16 threads with 4k blocks per bluestore OSD on 40G IB net, but that is the good result, 10k per OSD is not bad. With io-thread=1 I can get only 1400-1600 IOPS of random writes.
The problem here is the OSD code latencies an WA produced on every operation. OSD itself can take 700 μs (0,7ms) just to execute one IO operation, so even on RAM disk, where kernel IO operations in <10μs you can't barely reach 3k IOPS with a best high freq CPU.

P.S. Random read is about 35-50k IOPS on the same system, but all of them is just OSD performance mesure (data was read from OSD cache, when testing not disk IO is done)

Thanks, but no this is not good... Currently single method to maximize utilization of NVMe device (for example Optane with ~500k iops rw with 4kb blocks) is to split more than 10-50 OSD per device. And of course we need wait implementation of io_uring: https://github.com/ceph/ceph/pull/27392

Alwin · May 8, 2019

Alibek said:
Thanks, but no this is not good... Currently single method to maximize utilization of NVMe device (for example Optane with ~500k iops rw with 4kb blocks) is to split more than 10-50 OSD per device. And of course we need wait implementation of io_uring: https://github.com/ceph/ceph/pull/27392

Or using dpdk to access the NVMe devices directly.
https://github.com/ceph/dpdk

Alibek · May 8, 2019

Alwin said:
Or using dpdk to access the NVMe devices directly.
https://github.com/ceph/dpdk

And proxmox need ovs-dpdk too

Alwin · May 9, 2019

Alibek said:
And proxmox need ovs-dpdk too

Software with dpdk in mind needs to be recompiled for the software version used and therefore is not a stock solution.

Alibek · May 9, 2019

Alwin said:
Software with dpdk in mind needs to be recompiled for the software version used and therefore is not a stock solution.

https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview

performance-comparison-native-open-vswitch-ovs-and-ovs-with-data-plane-dev-kit.png

Paspao · Jul 24, 2019

Hello,

I am a little confused by IOPS results in the Benchmark PDF, around 200 Write IOPS on 10Gb and around 300 IOPS on 100Gb/s net?

Then in this thread people talk about reaching thousands of Write IOPS on Ceph, what am I missing?

If I have 30 LXC that work with large number of small files better considel local SSD instead of Hyperconverged Ceph?

Thank you.
P.

Alwin · Jul 24, 2019

Paspao said:
I am a little confused by IOPS results in the Benchmark PDF, around 200 Write IOPS on 10Gb and around 300 IOPS on 100Gb/s net?

The IOps are for 4 MB objects and a runtime of 60 sec.

Paspao said:
Then in this thread people talk about reaching thousands of Write IOPS on Ceph, what am I missing?

The rados bench (in these tests) is a single client with 16 threads, while a cluster usually has multiple concurrent clients accessing Ceph.

Paspao said:
If I have 30 LXC that work with large number of small files better considel local SSD instead of Hyperconverged Ceph?

Possible this depends on the IOps and latency requirements of the workload. For Ceph, the containers us KRBD and have the page cache available to them. This can greatly increase the performance. But in all, you need to test and see which solution works better for the workload.

Rosario Contarino · Aug 14, 2019

Hi all,
we have been benchmarking CEPHFS with the following configuration:
4 x Dell PowerEdge R730xd, 2 x CPU each with 96GB RAM each and 12 x 4TB HDD each, retail 10Gbps.

Then we built a container (*) with CentOS and ran our benchmarks on its local disk, which was a virtual disk within CEPHFS, and we already got impressive results. (we are already outperforming Nutanix for files up to 8MB, three times faster than a classic VMWare 6.5 with a Dell/EMC SAN.

We are now procuring SSDs to replace all our 48 HDDs and we will run our final tests and will publish them in here.

Our goal is to reach at least 600MB sustained writing for files up to 1TB to outperform our GPFS systems.

Question I have is:
We would like to mount (using the kernel module) CEPHFS within our CentOS 6.10 container and then run the benchmark on that mounted CEPHFS.
Could you please point us out on the documentation describing how to mount CEPHFS from ProxMox into a CentOS machine?

Thank you
Rosario

(*) side note. After restarting the cluster a couple of times we are no more able to access the console of the container. Any suggestion?

Alwin · Aug 14, 2019

Rosario Contarino said:
We would like to mount (using the kernel module) CEPHFS within our CentOS 6.10 container and then run the benchmark on that mounted CEPHFS.
Could you please point us out on the documentation describing how to mount CEPHFS from ProxMox into a CentOS machine?

I don't think a direct mount into a container will work, as the mount happens more than once. Better mount the CephFS on the node and use a mountpoint pointing to the CephFS storage.

Rosario Contarino said:
(*) side note. After restarting the cluster a couple of times we are no more able to access the console of the container. Any suggestion?

Better open up a new thread for this, makes it more visible.

Rosario Contarino · Aug 14, 2019

Alwin said:
I don't think a direct mount into a container will work, as the mount happens more than once. Better mount the CephFS on the node and use a mountpoint pointing to the CephFS storage.

Hi Alwin, thanks for your prompt reply. Could you please elaborate a bit more on this? In our future scenario we would like to have CEPHFS mounted by several containers, each probably running on different nodes, each accessing CEPHFS (different paths though). How would you recommend us to do it?

EDIT
We are now procuring the SSDs you chosen for your benchmark. If you could give us bit more directions on how you recommend to access CEPHFS from containers that would be awesome. Thank you.

Rosario Contarino · Aug 20, 2019

Still working on benchmarking CEPHFS, we noted the following different behavior:

a. A container with a virtual disk stored on CEPHFS, benchmark running on its local /tmp, bandwidth of approx 450MB/s
b. Same container with BIND MOUNT exported by host and benchmark running on this shared folder, bandwidth of approx 70MB/s

Any idea on why case b. is over 6 times slower than a. ?

Proxmox VE Ceph Benchmark 2018/02

Distinguished Member

Well-Known Member

Proxmox Retired Staff

Member

Well-Known Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Active Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

We value your privacy