Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. martin

    martin Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    625
    Likes Received:
    289
    To optimize performance in hyper-converged deployments with Proxmox VE and Ceph storage
    hardware setup is an important factor.

    This Ceph benchmark shows some examples. We are also curious to hear about your setups and performance outcomes, please post and discuss this here.

    Download PDF
    Proxmox VE Ceph Benchmark 2018/02
    __________________
    Best regards,

    Martin Maurer
    Proxmox VE project leader
     
    ddhulla, El Tebe, r.jochum and 8 others like this.
  2. PigLover

    PigLover Active Member

    Joined:
    Apr 8, 2013
    Messages:
    100
    Likes Received:
    32
    Interesting and useful write up. A bit summarized on the results presented (thin on details) but still quite useful.

    I was surprised to see the large read performance gain with 100Gbe network vs 10Gbe, especially given the close race between them on the write side. Some more digging on this - and potential optimization strategy - would be warranted.

    Also, the introduction discusses the possibility of using a 3-node cluster with the comment that in a three node cluster "the data is still available after the loss of a node". While true, this is sorely incomplete and misleading. If you are going to make this statement you really owe your readers at last a slighly more detailed treatment of failure modes showing why it takes "replication count"+1 nodes (four in this case) to maintain fully stable operation with a failed node and some treatment of why odd-numbers of nodes create more resilient outcomes.
     
  3. tschanness

    tschanness Member

    Joined:
    Oct 30, 2016
    Messages:
    275
    Likes Received:
    19
    Hi,
    interesting, thanks for posting. Do you have 40GBit/s NICs as well? Were your interfaces bonded? I guess a lot of people would be interessted in LAGs with 2-4 GBit NICs.
    I would be interessted in a benchmark with 8 OSDs (or even more).

    Thanks,
    Jonas
     
  4. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,159
    Likes Received:
    352
    We did not used any bonds. Anything slower than 10 Gbit is a big limiting factor for SSD Ceph Clusters.

    We did not test 40 Gbit.

    If you can improve your network throughput (bonds) and latency, it will help.

    The performance with more OSDs (e.g. 8 OSD per node) is quite similar to our 4 OSD per node cluster. Just add more if you need more capacity. This is valid for SSD only clusters in these tests.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    tschanness likes this.
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @PigLover,
    The benchmark has been done with a 'rados bench', this is a single client with 16 threads writing to ceph. On a cluster in a production environment you will have concurrent writes (and reads, ofc) of different clients, where the write gap will be even clearer.

    You are talking about a technical white paper, not a benchmark paper. I agree, in a technical whitepaper, this would need more clarification.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. PigLover

    PigLover Active Member

    Joined:
    Apr 8, 2013
    Messages:
    100
    Likes Received:
    32
    Agreed.

    But you make the claim about being able to run a 3-node cluster and still access the data with a node OOS. While it is "true", it is also dangerous guidance and shouldn't be given without a caution - even in a bechmarking note.
     
    ddhulla, LorenTedford and fibo_fr like this.
  7. alexskysilk

    alexskysilk Active Member
    Proxmox VE Subscriber

    Joined:
    Oct 16, 2015
    Messages:
    433
    Likes Received:
    48
    Martin, how many OSD did you place on each nvme dev? I'm getting slightly better numbers then you're showing but with single osd per drive, but according to best practices (http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments) the optimal results are with 4 osd/drive.
     
  8. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @alexskysilk, we did not use any separate DB/WAL device.
    Depending on your hardware and configuration (default config in our tests), you might achieve better (or worse) results.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @PigLover, I am still not seeing your point here. How would you like to see the section phrased?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    Thanks for posting this benchmark!

    While I appreciate the test results to have them as a reference point when implementing PVE with Ceph, I am not sure that are representative of what people actually need/care for in real-life workloads.

    High throughput numbers are cool eye catchers when you want to boost your marketing material (over 9000Mbps! lol), but real performance is at IOPs and latency numbers IMHO.
    Typical real life workloads rarely need 1GB/s write speeds. But high IOPs certainly make a difference - especially during night hours when multiple (guest) backups for multiple VMs run at the same time (no control over it as these are client VMs with no access to the guest OS).

    I tried the rados bench tests on my lab by using a 4K block size to measure the IOPs performance and while reads reach up to 20k IOPs, writes can barely go over 5k IOPs.
    Though I am not sure if both results are any good since each disk on its own can do well over 25k IOPs in writes and 70k IOPs in reads (just a little lower than the advertised specs for the disks used for testing) according to storagereview.com benchmarks (can't post a direct link due to being a new user).
    Or according to your benchmark of your SM863 240GB disks they can do 17k write IOPs using fio with 4k bs.

    My lab setup is a 3 node cluster consisting of 3x HP DL360p G8 with 4 SAMSUNG SM863 960GB each (1osd per physical drive) and Xeon E5-2640 with 32GB ECC RAM.

    The HP SmartArray P420i onboard controllers are set to HBA mode so the disks are presented directly to PVE without any RAID handling/overhead.

    The networking is based on Infiniband (40G) in 'connected' mode with 65520 bytes MTU with active/passive bonding, and I get a maximum of 23Gbit raw networking transfer speeds (iperf measured) between the 3 nodes with IPoIB, which is good enough for testing (or at least two times+ better than 10GbE).

    Here are my test PVE nodes packages versions:
    Code:
    # pveversion -v
    proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
    pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
    pve-kernel-4.13.13-6-pve: 4.13.13-41
    pve-kernel-4.13.13-5-pve: 4.13.13-38
    ceph: 12.2.2-pve1
    corosync: 2.4.2-pve3
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: not correctly installed
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-common-perl: 5.0-28
    libpve-guest-common-perl: 2.0-14
    libpve-http-server-perl: 2.0-8
    libpve-storage-perl: 5.0-17
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 2.1.1-2
    lxcfs: 2.0.8-2
    novnc-pve: 0.6-4
    openvswitch-switch: 2.7.0-2
    proxmox-widget-toolkit: 1.0-11
    pve-cluster: 5.0-20
    pve-container: 2.0-19
    pve-docs: 5.1-16
    pve-firewall: 3.0-5
    pve-firmware: 2.0-3
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-4
    pve-libspice-server1: 0.12.8-3
    pve-qemu-kvm: 2.9.1-9
    pve-xtermjs: 1.0-2
    qemu-server: 5.0-22
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    And here are my rados bench results:

    Throughput test (4MB block size) - WRITES
    Code:
    Total time run:         60.041188
    Total writes made:      14215
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     947.017
    Stddev Bandwidth:       122.447
    Max bandwidth (MB/sec): 1060
    Min bandwidth (MB/sec): 368
    Average IOPS:           236
    Stddev IOPS:            30
    Max IOPS:               265
    Min IOPS:               92
    Average Latency(s):     0.0675757
    Stddev Latency(s):      0.0502804
    Max latency(s):         0.966638
    Min latency(s):         0.0166262
    Throughput test (4MB block size) - READS
    Code:
    Total time run:       21.595730
    Total reads made:     14215
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   2632.93
    Average IOPS:         658
    Stddev IOPS:          10
    Max IOPS:             685
    Min IOPS:             646
    Average Latency(s):   0.0233569
    Max latency(s):       0.158183
    Min latency(s):       0.0123441
    IOPs test (4K block size) - WRITES
    Code:
    Total time run:         60.002736
    Total writes made:      315615
    Write size:             4096
    Object size:            4096
    Bandwidth (MB/sec):     20.5469
    Stddev Bandwidth:       0.847211
    Max bandwidth (MB/sec): 23.3555
    Min bandwidth (MB/sec): 16.7188
    Average IOPS:           5260
    Stddev IOPS:            216
    Max IOPS:               5979
    Min IOPS:               4280
    Average Latency(s):     0.00304033
    Stddev Latency(s):      0.000755765
    Max latency(s):         0.0208767
    Min latency(s):         0.00156849
    IOPs test (4K block size) - READS
    Code:
    Total time run:       15.658241
    Total reads made:     315615
    Read size:            4096
    Object size:          4096
    Bandwidth (MB/sec):   78.7362
    Average IOPS:         20156
    Stddev IOPS:          223
    Max IOPS:             20623
    Min IOPS:             19686
    Average Latency(s):   0.000779536
    Max latency(s):       0.00826032
    Min latency(s):       0.000374155
    Any ideas why the IOPs performance is so low on the 4K bs tests compared to using the disks standalone without Ceph?
    I understand that there will definitely be a slowdown in performance due to the nature/overhead of any software defined storage solution, but are there any suggestions to make these results better since there are too much spare resources to be utilized?

    Or to put it another way, how can I find what is the bottleneck in my tests (since the network and the disks can handle way more than what I am currently getting) ?

    Thanks and apologies for the long post :)
     
  11. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    3x 5260 = 15780 IO/s, assuming a replica of 3. That is close to our 4k fio benchmark. Ceph syncs its objects onto three disk and then gets a ACK back. This is also why reads are performing significantly better then writes.

    Code:
    fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest
    Taken from storagereview.com, compare their test with ours. I suppose, the SM863 960GB, will show similar results when run with our fio benchmark.

    Code:
    4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
    ...
    fio-2.16
    Starting 16 processes
    Jobs: 16 (f=16): [r(16)] [100.0% done] [374.5MB/0KB/0KB /s] [95.9K/0/0 iops] [eta 00m:00s]
    4ktest: (groupid=0, jobs=16): err= 0: pid=14394: Fri Mar  2 10:42:04 2018
      read : io=22466MB, bw=383401KB/s, iops=95850, runt= 60002msec
        slat (usec): min=2, max=23007, avg=107.02, stdev=560.04
        clat (usec): min=67, max=28002, avg=2562.47, stdev=1688.18
         lat (usec): min=110, max=30041, avg=2669.49, stdev=1736.84
        clat percentiles (usec):
         |  1.00th=[  354],  5.00th=[  470], 10.00th=[  604], 20.00th=[  884],
         | 30.00th=[ 1208], 40.00th=[ 1576], 50.00th=[ 2024], 60.00th=[ 3376],
         | 70.00th=[ 3760], 80.00th=[ 4128], 90.00th=[ 4768], 95.00th=[ 5216],
         | 99.00th=[ 5856], 99.50th=[ 7264], 99.90th=[10688], 99.95th=[11328],
         | 99.99th=[15168]
        lat (usec) : 100=0.01%, 250=0.02%, 500=6.11%, 750=9.20%, 1000=8.40%
        lat (msec) : 2=25.86%, 4=27.18%, 10=23.09%, 20=0.15%, 50=0.01%
      cpu          : usr=0.98%, sys=7.25%, ctx=2890416, majf=0, minf=155
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=5751202/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=16
    
    Run status group 0 (all jobs):
       READ: io=22466MB, aggrb=383400KB/s, minb=383400KB/s, maxb=383400KB/s, mint=60002msec, maxt=60002msec
    
    Disk stats (read/write):
      sdb: ios=5750566/223, merge=34/2, ticks=8329028/216, in_queue=8412756, util=100.00%
    Made with the above fio line on one of our SM863 240GB.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    I understand that there are 3 writes to be acknowledged and that, that will cause a decrease in total IO/s, but that does not fully explain the huge difference in write IO/s when writing directly on a single disk compared to writing to a ceph pool.
    We are not talking about 5-10% performance loss here. From 17k IO/s to 5k IO/s is like a 70% reduction (if I am not screwing up the percentage calculation)!

    Also your calculation of 3x5260=15780 IO/s doesn't sound like a correct methodology to measure the IO/s. Or at least it's irrelevant to multiply by replicas.
    If that's how we should calculate the total ΙΟ/s, then the theoretical maximum should be 17k IO/s x 3 = 51k IOPs. Which 15.7k IO/s still is ~69% less than 51k IO/s.

    So, what can be done (in terms of configuration) to improve this number? (regardless of the actual number and how to measure it)

    Obviously the hardware can handle way more. The bottleneck seems to be somewhere in Ceph, and the 'x3 writes acks' doesn't sound like a valid reason for this.
    The CPU usage during these tests is ~50% so there's plenty of room there.

    I don't have any test results at hand, but I don't think that when doing a RAID5 or RAID10 you get a 70% performance loss in IO/s just because of the 3-4 write ACKs. Of course comparing Ceph to RAID is like comparing Oranges to Apples. But still.. 70% performance drop in IO/s does not seem normal...
     
  13. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    From my understanding, the ACK is returned, when the third copy has been written. So a write takes in the worst case 3x times longer (first write to primary, then in parallel to the secondary and tertiary). And I guess, your cluster is not empty, so OSDs are already busy serving ohter clients.
    Code:
    Total time run:         60.001525
    Total writes made:      544276
    Write size:             4096
    Object size:            4096
    Bandwidth (MB/sec):     35.4337
    Stddev Bandwidth:       1.00231
    Max bandwidth (MB/sec): 37.0352
    Min bandwidth (MB/sec): 33.7891
    Average IOPS:           9071
    Stddev IOPS:            256
    Max IOPS:               9481
    Min IOPS:               8650
    Average Latency(s):     0.00176275
    Stddev Latency(s):      0.000345305
    Max latency(s):         0.0103017
    Min latency(s):         0.00107972
    3 PVE hosts with 4x Bluestore OSD each (total of 12). We achieve around 9k IO/s, with 10 GbE (MTU 9000) and the cluster was idle (also the switch). Our latency is lower then in your figures and I presume that you have already a workload running on your cluster. Further, our server and network hardware differs from yours. The encapsulation of ethernet packets on infiniband will certainly add to the latency that you observe.

    To make a comparison, you would need your setup empty (not in production) and do baseline benchmarks and testing.

    Ceph uses by default 4 MB objects and is optimized to handle those, while a test with 4KB makes sense for comparison of a single drive, it does less for ceph.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    All my posted results are from a lab cluster that is completely empty and set up just for benchmarks and tests.
    The rados benchmarks were done without a single VM running at the time and networking wise all nodes can do >20Gbps without breaking a sweat.
    All the CPUs on all nodes during the rados benchmarks never went over 50%.
    So there is no apparent bottleneck anywhere on the physical hardware.

    The bottom line is, how do you make ceph perform faster than this? As I said, high throughput numbers are useless when it comes to real life workloads. And getting only 5k IO/s when each drive can do WAY more is just bad from whatever aspect you look at it.

    I repeat. We are talking about 70% decrease in IO performance. That is cannot be caused by waiting 3 times longer for the replicas ACK. It's just preposterous for a technology that's supposed to be super high-performant for the cloud.
    Even at 9k IO/s it's still a 47% decrease in performance! Something is wrong with these results.
     
  15. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Look at the latency, the 10 GbE has lower latency then the 40 Gb IPoIB. Ceph doesn't work with IB nativ, this costs a lot of CPU and the packtes go through the IP stack, so some features of IB aren't used at all (this adds latency).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  16. dcsapak

    dcsapak Proxmox Staff Member
    Staff Member

    Joined:
    Feb 1, 2016
    Messages:
    2,924
    Likes Received:
    266
    why not ? when you get 17k iops from a disk, you have to wait 0.05 ms (or 50us) for a write, then add the latency for the network e.g. 10-50 us for 10gbit and add again the latency of the disk again 50us
    now we are at 110-150us latency for one write (not factoring in cpu/nic/kernel/ceph latency) which is up to 3 times as slow and you have a maximum of ~6500-9000 IOPS

    edit: i also did not account for the latency of the network for the ACK, so this would further reduce the IOPS
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  17. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    Does that mean that the most IOPs you can attain on any 3 replica ceph installation on 10GbE with these Samsung SSDs, are at best 6500-9000 due to the latencies you just described?

    Sorry for insisting on the same stuff, I am just trying to make sense of the results.
     
  18. NewDude

    NewDude Member
    Proxmox VE Subscriber

    Joined:
    Feb 24, 2018
    Messages:
    58
    Likes Received:
    5
    Let me try to help Cha0s out:

    What's the best documented ceph performance on proxmox to date?

    :)
     
  19. alexskysilk

    alexskysilk Active Member
    Proxmox VE Subscriber

    Joined:
    Oct 16, 2015
    Messages:
    433
    Likes Received:
    48
    Not separate DB/WAL (not much point if the data resides on nvme). Multiple OSDs per NVME. I'm attempting to follow the document here: https://software.intel.com/en-us/articles/accelerating-your-nvme-drives-with-spdk which should yield an order of magnitude improvement in IOPs.

    building spdk is simple enough but I'm having trouble figuring out how to do it and keep the proxmox ceph functionality happy. For now, I'm stuck trying to adapt the setup script here: https://github.com/spdk/spdk/blob/master/scripts/ceph/start.sh but its slow going; the alternative is to manually map devices and use ceph-disk to create OSDs. If anyone wants to pitch in...
     
  20. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @alexskysilk, SPDK increases performance as Ceph doesn't need to go through the kernel to access the NVMe drive, but it will not take away the latency from the network. I hope, you may share some benchmarks with us. ;)

    Note: SPDK needs its own build packages and is not a straight forward setup, for anyone starting out with ceph.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice