Proxmox VE Ceph Benchmark 2018/02

martin · Feb 27, 2018

To optimize performance in hyper-converged deployments with Proxmox VE and Ceph storage
hardware setup is an important factor.

This Ceph benchmark shows some examples. We are also curious to hear about your setups and performance outcomes, please post and discuss this here.

Download PDF
Proxmox VE Ceph Benchmark 2018/02
__________________
Best regards,

Martin Maurer
Proxmox VE project leader

PigLover · Feb 27, 2018

Interesting and useful write up. A bit summarized on the results presented (thin on details) but still quite useful.

I was surprised to see the large read performance gain with 100Gbe network vs 10Gbe, especially given the close race between them on the write side. Some more digging on this - and potential optimization strategy - would be warranted.

Also, the introduction discusses the possibility of using a 3-node cluster with the comment that in a three node cluster "the data is still available after the loss of a node". While true, this is sorely incomplete and misleading. If you are going to make this statement you really owe your readers at last a slighly more detailed treatment of failure modes showing why it takes "replication count"+1 nodes (four in this case) to maintain fully stable operation with a failed node and some treatment of why odd-numbers of nodes create more resilient outcomes.

tschanness · Feb 28, 2018

Hi,
interesting, thanks for posting. Do you have 40GBit/s NICs as well? Were your interfaces bonded? I guess a lot of people would be interessted in LAGs with 2-4 GBit NICs.
I would be interessted in a benchmark with 8 OSDs (or even more).

Thanks,
Jonas

tom · Feb 28, 2018

tschanness said:
Hi,
interesting, thanks for posting. Do you have 40GBit/s NICs as well? Were your interfaces bonded? I guess a lot of people would be interessted in LAGs with 2-4 GBit NICs.
I would be interessted in a benchmark with 8 OSDs (or even more).

Thanks,
Jonas

We did not used any bonds. Anything slower than 10 Gbit is a big limiting factor for SSD Ceph Clusters.

We did not test 40 Gbit.

If you can improve your network throughput (bonds) and latency, it will help.

The performance with more OSDs (e.g. 8 OSD per node) is quite similar to our 4 OSD per node cluster. Just add more if you need more capacity. This is valid for SSD only clusters in these tests.

Alwin · Feb 28, 2018

@PigLover,

PigLover said:
I was surprised to see the large read performance gain with 100Gbe network vs 10Gbe, especially given the close race between them on the write side. Some more digging on this - and potential optimization strategy - would be warranted.

The benchmark has been done with a 'rados bench', this is a single client with 16 threads writing to ceph. On a cluster in a production environment you will have concurrent writes (and reads, ofc) of different clients, where the write gap will be even clearer.

PigLover said:
Also, the introduction discusses the possibility of using a 3-node cluster with the comment that in a three node cluster "the data is still available after the loss of a node". While true, this is sorely incomplete and misleading. If you are going to make this statement you really owe your readers at last a slighly more detailed treatment of failure modes showing why it takes "replication count"+1 nodes (four in this case) to maintain fully stable operation with a failed node and some treatment of why odd-numbers of nodes create more resilient outcomes.

You are talking about a technical white paper, not a benchmark paper. I agree, in a technical whitepaper, this would need more clarification.

PigLover · Feb 28, 2018

Alwin said:
@PigLoverYou are talking about a technical white paper, not a benchmark paper. I agree, in a technical whitepaper, this would need more clarification.

Agreed.

But you make the claim about being able to run a 3-node cluster and still access the data with a node OOS. While it is "true", it is also dangerous guidance and shouldn't be given without a caution - even in a bechmarking note.

alexskysilk · Feb 28, 2018

martin said:
To optimize performance in hyper-converged deployments with Proxmox VE and Ceph storage
hardware setup is an important factor.

This Ceph benchmark shows some examples. We are also curious to hear about your setups and performance outcomes, please post and discuss this here.

Download PDF
Proxmox VE Ceph Benchmark 2018/02
__________________
Best regards,

Martin Maurer
Proxmox VE project leader

Martin, how many OSD did you place on each nvme dev? I'm getting slightly better numbers then you're showing but with single osd per drive, but according to best practices (http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments) the optimal results are with 4 osd/drive.

Alwin · Mar 1, 2018

@alexskysilk, we did not use any separate DB/WAL device.
Depending on your hardware and configuration (default config in our tests), you might achieve better (or worse) results.

Alwin · Mar 1, 2018

@PigLover, I am still not seeing your point here. How would you like to see the section phrased?

Cha0s · Mar 1, 2018

Thanks for posting this benchmark!

While I appreciate the test results to have them as a reference point when implementing PVE with Ceph, I am not sure that are representative of what people actually need/care for in real-life workloads.

High throughput numbers are cool eye catchers when you want to boost your marketing material (over 9000Mbps! lol), but real performance is at IOPs and latency numbers IMHO.
Typical real life workloads rarely need 1GB/s write speeds. But high IOPs certainly make a difference - especially during night hours when multiple (guest) backups for multiple VMs run at the same time (no control over it as these are client VMs with no access to the guest OS).

I tried the rados bench tests on my lab by using a 4K block size to measure the IOPs performance and while reads reach up to 20k IOPs, writes can barely go over 5k IOPs.
Though I am not sure if both results are any good since each disk on its own can do well over 25k IOPs in writes and 70k IOPs in reads (just a little lower than the advertised specs for the disks used for testing) according to storagereview.com benchmarks (can't post a direct link due to being a new user).
Or according to your benchmark of your SM863 240GB disks they can do 17k write IOPs using fio with 4k bs.

My lab setup is a 3 node cluster consisting of 3x HP DL360p G8 with 4 SAMSUNG SM863 960GB each (1osd per physical drive) and Xeon E5-2640 with 32GB ECC RAM.

The HP SmartArray P420i onboard controllers are set to HBA mode so the disks are presented directly to PVE without any RAID handling/overhead.

The networking is based on Infiniband (40G) in 'connected' mode with 65520 bytes MTU with active/passive bonding, and I get a maximum of 23Gbit raw networking transfer speeds (iperf measured) between the 3 nodes with IPoIB, which is good enough for testing (or at least two times+ better than 10GbE).

Here are my test PVE nodes packages versions:

Code:

# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

And here are my rados bench results:

Throughput test (4MB block size) - WRITES

Code:

Total time run:         60.041188
Total writes made:      14215
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     947.017
Stddev Bandwidth:       122.447
Max bandwidth (MB/sec): 1060
Min bandwidth (MB/sec): 368
Average IOPS:           236
Stddev IOPS:            30
Max IOPS:               265
Min IOPS:               92
Average Latency(s):     0.0675757
Stddev Latency(s):      0.0502804
Max latency(s):         0.966638
Min latency(s):         0.0166262

Throughput test (4MB block size) - READS

Code:

Total time run:       21.595730
Total reads made:     14215
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2632.93
Average IOPS:         658
Stddev IOPS:          10
Max IOPS:             685
Min IOPS:             646
Average Latency(s):   0.0233569
Max latency(s):       0.158183
Min latency(s):       0.0123441

IOPs test (4K block size) - WRITES

Code:

Total time run:         60.002736
Total writes made:      315615
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     20.5469
Stddev Bandwidth:       0.847211
Max bandwidth (MB/sec): 23.3555
Min bandwidth (MB/sec): 16.7188
Average IOPS:           5260
Stddev IOPS:            216
Max IOPS:               5979
Min IOPS:               4280
Average Latency(s):     0.00304033
Stddev Latency(s):      0.000755765
Max latency(s):         0.0208767
Min latency(s):         0.00156849

IOPs test (4K block size) - READS

Code:

Total time run:       15.658241
Total reads made:     315615
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   78.7362
Average IOPS:         20156
Stddev IOPS:          223
Max IOPS:             20623
Min IOPS:             19686
Average Latency(s):   0.000779536
Max latency(s):       0.00826032
Min latency(s):       0.000374155

Any ideas why the IOPs performance is so low on the 4K bs tests compared to using the disks standalone without Ceph?
I understand that there will definitely be a slowdown in performance due to the nature/overhead of any software defined storage solution, but are there any suggestions to make these results better since there are too much spare resources to be utilized?

Or to put it another way, how can I find what is the bottleneck in my tests (since the network and the disks can handle way more than what I am currently getting) ?

Thanks and apologies for the long post

Alwin · Mar 2, 2018

3x 5260 = 15780 IO/s, assuming a replica of 3. That is close to our 4k fio benchmark. Ceph syncs its objects onto three disk and then gets a ACK back. This is also why reads are performing significantly better then writes.

Code:

fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest

Taken from storagereview.com, compare their test with ours. I suppose, the SM863 960GB, will show similar results when run with our fio benchmark.

Code:

4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 16 processes
Jobs: 16 (f=16): [r(16)] [100.0% done] [374.5MB/0KB/0KB /s] [95.9K/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=16): err= 0: pid=14394: Fri Mar  2 10:42:04 2018
  read : io=22466MB, bw=383401KB/s, iops=95850, runt= 60002msec
    slat (usec): min=2, max=23007, avg=107.02, stdev=560.04
    clat (usec): min=67, max=28002, avg=2562.47, stdev=1688.18
     lat (usec): min=110, max=30041, avg=2669.49, stdev=1736.84
    clat percentiles (usec):
     |  1.00th=[  354],  5.00th=[  470], 10.00th=[  604], 20.00th=[  884],
     | 30.00th=[ 1208], 40.00th=[ 1576], 50.00th=[ 2024], 60.00th=[ 3376],
     | 70.00th=[ 3760], 80.00th=[ 4128], 90.00th=[ 4768], 95.00th=[ 5216],
     | 99.00th=[ 5856], 99.50th=[ 7264], 99.90th=[10688], 99.95th=[11328],
     | 99.99th=[15168]
    lat (usec) : 100=0.01%, 250=0.02%, 500=6.11%, 750=9.20%, 1000=8.40%
    lat (msec) : 2=25.86%, 4=27.18%, 10=23.09%, 20=0.15%, 50=0.01%
  cpu          : usr=0.98%, sys=7.25%, ctx=2890416, majf=0, minf=155
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=5751202/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=22466MB, aggrb=383400KB/s, minb=383400KB/s, maxb=383400KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  sdb: ios=5750566/223, merge=34/2, ticks=8329028/216, in_queue=8412756, util=100.00%

Made with the above fio line on one of our SM863 240GB.

Cha0s · Mar 2, 2018

Alwin said:
3x 5260 = 15780 IO/s, assuming a replica of 3. That is close to our 4k fio benchmark. Ceph syncs its objects onto three disk and then gets a ACK back. This is also why reads are performing significantly better then writes.

I understand that there are 3 writes to be acknowledged and that, that will cause a decrease in total IO/s, but that does not fully explain the huge difference in write IO/s when writing directly on a single disk compared to writing to a ceph pool.
We are not talking about 5-10% performance loss here. From 17k IO/s to 5k IO/s is like a 70% reduction (if I am not screwing up the percentage calculation)!

Also your calculation of 3x5260=15780 IO/s doesn't sound like a correct methodology to measure the IO/s. Or at least it's irrelevant to multiply by replicas.
If that's how we should calculate the total ΙΟ/s, then the theoretical maximum should be 17k IO/s x 3 = 51k IOPs. Which 15.7k IO/s still is ~69% less than 51k IO/s.

So, what can be done (in terms of configuration) to improve this number? (regardless of the actual number and how to measure it)

Obviously the hardware can handle way more. The bottleneck seems to be somewhere in Ceph, and the 'x3 writes acks' doesn't sound like a valid reason for this.
The CPU usage during these tests is ~50% so there's plenty of room there.

I don't have any test results at hand, but I don't think that when doing a RAID5 or RAID10 you get a 70% performance loss in IO/s just because of the 3-4 write ACKs. Of course comparing Ceph to RAID is like comparing Oranges to Apples. But still.. 70% performance drop in IO/s does not seem normal...

Alwin · Mar 2, 2018

Cha0s said:
Also your calculation of 3x5260=15780 IO/s doesn't sound like a correct methodology to measure the IO/s.

From my understanding, the ACK is returned, when the third copy has been written. So a write takes in the worst case 3x times longer (first write to primary, then in parallel to the secondary and tertiary). And I guess, your cluster is not empty, so OSDs are already busy serving ohter clients.

Code:

Total time run:         60.001525
Total writes made:      544276
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     35.4337
Stddev Bandwidth:       1.00231
Max bandwidth (MB/sec): 37.0352
Min bandwidth (MB/sec): 33.7891
Average IOPS:           9071
Stddev IOPS:            256
Max IOPS:               9481
Min IOPS:               8650
Average Latency(s):     0.00176275
Stddev Latency(s):      0.000345305
Max latency(s):         0.0103017
Min latency(s):         0.00107972

3 PVE hosts with 4x Bluestore OSD each (total of 12). We achieve around 9k IO/s, with 10 GbE (MTU 9000) and the cluster was idle (also the switch). Our latency is lower then in your figures and I presume that you have already a workload running on your cluster. Further, our server and network hardware differs from yours. The encapsulation of ethernet packets on infiniband will certainly add to the latency that you observe.

To make a comparison, you would need your setup empty (not in production) and do baseline benchmarks and testing.

Ceph uses by default 4 MB objects and is optimized to handle those, while a test with 4KB makes sense for comparison of a single drive, it does less for ceph.

Cha0s · Mar 2, 2018

All my posted results are from a lab cluster that is completely empty and set up just for benchmarks and tests.
The rados benchmarks were done without a single VM running at the time and networking wise all nodes can do >20Gbps without breaking a sweat.
All the CPUs on all nodes during the rados benchmarks never went over 50%.
So there is no apparent bottleneck anywhere on the physical hardware.

The bottom line is, how do you make ceph perform faster than this? As I said, high throughput numbers are useless when it comes to real life workloads. And getting only 5k IO/s when each drive can do WAY more is just bad from whatever aspect you look at it.

I repeat. We are talking about 70% decrease in IO performance. That is cannot be caused by waiting 3 times longer for the replicas ACK. It's just preposterous for a technology that's supposed to be super high-performant for the cloud.
Even at 9k IO/s it's still a 47% decrease in performance! Something is wrong with these results.

Alwin · Mar 2, 2018

Look at the latency, the 10 GbE has lower latency then the 40 Gb IPoIB. Ceph doesn't work with IB nativ, this costs a lot of CPU and the packtes go through the IP stack, so some features of IB aren't used at all (this adds latency).

dcsapak · Mar 2, 2018

Cha0s said:
That is cannot be caused by waiting 3 times longer for the replicas ACK.

why not ? when you get 17k iops from a disk, you have to wait 0.05 ms (or 50us) for a write, then add the latency for the network e.g. 10-50 us for 10gbit and add again the latency of the disk again 50us
now we are at 110-150us latency for one write (not factoring in cpu/nic/kernel/ceph latency) which is up to 3 times as slow and you have a maximum of ~6500-9000 IOPS

edit: i also did not account for the latency of the network for the ACK, so this would further reduce the IOPS

Cha0s · Mar 2, 2018

Does that mean that the most IOPs you can attain on any 3 replica ceph installation on 10GbE with these Samsung SSDs, are at best 6500-9000 due to the latencies you just described?

Sorry for insisting on the same stuff, I am just trying to make sense of the results.

NewDude · Mar 2, 2018

Let me try to help Cha0s out:

What's the best documented ceph performance on proxmox to date?

alexskysilk · Mar 2, 2018

Alwin said:
@alexskysilk, we did not use any separate DB/WAL device.
Depending on your hardware and configuration (default config in our tests), you might achieve better (or worse) results.

Not separate DB/WAL (not much point if the data resides on nvme). Multiple OSDs per NVME. I'm attempting to follow the document here: https://software.intel.com/en-us/articles/accelerating-your-nvme-drives-with-spdk which should yield an order of magnitude improvement in IOPs.

building spdk is simple enough but I'm having trouble figuring out how to do it and keep the proxmox ceph functionality happy. For now, I'm stuck trying to adapt the setup script here: https://github.com/spdk/spdk/blob/master/scripts/ceph/start.sh but its slow going; the alternative is to manually map devices and use ceph-disk to create OSDs. If anyone wants to pitch in...

Alwin · Mar 5, 2018

@alexskysilk, SPDK increases performance as Ceph doesn't need to go through the kernel to access the NVMe drive, but it will not take away the latency from the network. I hope, you may share some benchmarks with us.

Note: SPDK needs its own build packages and is not a straight forward setup, for anyone starting out with ceph.

Proxmox VE Ceph Benchmark 2018/02

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Proxmox Retired Staff

Renowned Member

Distinguished Member

Proxmox Retired Staff

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Proxmox Staff Member

Well-Known Member

Member

Distinguished Member

Proxmox Retired Staff