Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    With our hardware, it's a yes. But you sure can use different hardware or depending on your hardware, different settings. Eg. fibre cables instead of copper for ethernet, NVMe instead of SSD or settings for network tuning. We are using DAC cables for our 100 GbE.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    536
    Likes Received:
    57
    Working on it. As a point of order, the parent ceph benchmark document describes the test methodology as "fio --ioengine=libaio –filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio --output-format=terse,json,normal --output=fio.log --bandwidth-log" but the results for this test is nowhere in the document (not that it would be of any actual use in this context since you're writing directly to disk instead of the file system)

    stay tuned.
     
  3. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    536
    Likes Received:
    57
    testbed: 3 nodes consisting of:

    CPU: 2x Intel Xeon E5-2673 v4
    RAM: 8x32GB, DDR4, 2400MHz
    NIC: Connectx4, dual port, 100GBe operating mode
    OSD: 12x Hynix HFS960GD0MEE-5410A, FW 40033A00

    Comments: 4M write is not really a useful indicator for a hypervisor workload, but I understand people want to see the ooh-ahh MB/S numbers. So, without further ado:

    rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup
    Code:
    Total time run:         60.026194
    Total writes made:      40743
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     2715.01
    Stddev Bandwidth:       58.9678
    Max bandwidth (MB/sec): 2832
    Min bandwidth (MB/sec): 2512
    Average IOPS:           678
    Stddev IOPS:            14
    Max IOPS:               708
    Min IOPS:               628
    Average Latency(s):     0.023569
    Stddev Latency(s):      0.00984879
    Max latency(s):         0.242294
    Min latency(s):         0.0117974
    
    rados bench -p rbd -t 16 60 seq
    Code:
    Total time run:       34.092660
    Total reads made:     40495
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   4751.17
    Average IOPS:         1187
    Stddev IOPS:          24
    Max IOPS:             1224
    Min IOPS:             1100
    Average Latency(s):   0.0127535
    Max latency(s):       0.230566
    Min latency(s):       0.00275464
    
    rados bench -p rbd -t 16 60 rand
    Code:
    Total time run:       60.015327
    Total reads made:     76553
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   5102.23
    Average IOPS:         1275
    Stddev IOPS:          33
    Max IOPS:             1334
    Min IOPS:             1191
    Average Latency(s):   0.0118615
    Max latency(s):       0.27716
    Min latency(s):       0.00185301
    
     
  4. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    In the benchmark document, the chart above the fio command shows a subset of the results of the test.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    How does the direct fio output look like (fio test taken from the benchmark paper)? This would give us a comparison of the NVMe, especially latency.

    And would you mind to share a rados bench without dpdk?
    Code:
    Total time run:         60.028328
    
    Total writes made:      25675
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     1710.86
    Stddev Bandwidth:       48.2301
    Max bandwidth (MB/sec): 1768
    Min bandwidth (MB/sec): 1492
    Average IOPS:           427
    Stddev IOPS:            12
    Max IOPS:               442
    Min IOPS:               373
    Average Latency(s):     0.0374052
    Stddev Latency(s):      0.00804253
    Max latency(s):         0.104199
    Min latency(s):         0.0119991
    This is not really a comparable test, as we have only 3x Intel P3700 800GB (no dpdk), 'rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup'. But I think, the Hynix has a higher latency itself and that's why the rados latency is close to ours. That's why I am interested in your results, without dpdk and of the NVMe itself.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. fips

    fips Member

    Joined:
    May 5, 2014
    Messages:
    134
    Likes Received:
    5
    Hi,
    because SSD is such an essential part and on another hand it should be cost efficient (well at least for me), I did some benchmarking on several consumer SSDs.
    All test have been made with fio on the same system, with disabled write cache.
    Maybe it can be useful for somebody else too.
    Command:
    Code:
    fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
    Results:

    Code:
    SSD    BW    IOPS
    Transcend SSD370S 32GB    2143 KB/s    535
    Samsung 750 EVO 500GB    2071 KB/s    517
    Kingston SV300 240GB    93987 KB/s    23496
    Sandisk SDSSDH3250G 250GB    8925 KB/s    2231
    Toshiba TL100 240GB    3513 KB/s    878
    Sandisk SDSSDA240G 240GB    6957 KB/s    1739
    Transcend SSD320 256GB    5524 KB/s    1381
    Intensio 240GB    3445 KB/s    861
    Teamgroup L5 240GB    5034 KB/s    1258
    Toshiba TR200 240GB    3919 KB/s    979
    Micron 1100 256GB    5195 KB/s    1298
    Adata SX950 240GB    5917 KB/s    1479
    Sandisk SD8SB8U2561122 256GB    6936 KB/s    1734
    Kingston SUV400S37480G 480GB    2615 KB/s    653
    Corsair Force LE200 240GB    2970 KB/s    742
    PNY CS900 240GB    3910 KB/s    977
    Samsung 860 Pro 256GB    1883 KB/s    470
    Crucial MX500 250GB    9878 KB/s    2469
    Kingston SA400 240GB    2822 KB/s    705
    
     
    chrone likes this.
  7. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. LightKnight

    LightKnight New Member

    Joined:
    Mar 7, 2018
    Messages:
    3
    Likes Received:
    0
    I'm currently piecing together the parts for a 40Gbe production Ceph setup. Since we have the rack space I'm separating the Proxmox Ceph servers from the Proxmox VM servers. I think management will be easier for me that way and the load will be split up. To be honest I'm not sure how much of a CPU and RAM hog Ceph gets so it's better to be safe then sorry. I'll post some benchmarks when it's complete.

    I have a question though, does the Ceph backend really need a separate switch or can the backend and frontend (with seperate NICs) connect to the same switch and be separated with VLANs? I'm planning on having one Arista 7050QX-32S which has plenty 40Gbe ports for the whole setup (plus two 1Gbe switches for the cluster network).
     
  9. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    If you use our default setup then the cluster and public network is on the same IP range. A separation only makes sense if you really separate them physically (can be same switch), a simple separation by VLAN on the same NIC port will not bring any benefit.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Mario Hosse

    Mario Hosse New Member

    Joined:
    Oct 25, 2017
    Messages:
    19
    Likes Received:
    3
    @LightKnight
    Ram = 1GB for 1TB diskspace
    CPU Cores = 1x 64-bit AMD-64 via osd + 1x 64-bit AMD-64 via osd-mon + 1x 64-bit AMD-64 via mds

    Why do not you take a 40Gb dual network card for the nodes? Then you make a 2x40 GBit Bond per node and separate the networks via VLAN. I would take two switches because of the failure safety and interconnect with MLAG. So you could use smaller switches to get the same number of ports.

    @all
    Here my Test
    Node: Supermicro 2028BT-HNR+ 4xNode

    Per Node
    CPU: 2xE5-2650 v4 @ 2.20GHz
    RAM: 512GB 2400Mhz
    Network: 40Gb dual network card
    OSD per Node: 6 Intel SSD P4500 4 TB

    rados bench -p test 60 write -b 4M -t 16 --no-cleanup
    Code:
    Total time run:         60.019578
    Total writes made:      53479
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     3564.1
    Stddev Bandwidth:       77.0995
    Max bandwidth (MB/sec): 3732
    Min bandwidth (MB/sec): 3384
    Average IOPS:           891
    Stddev IOPS:            19
    Max IOPS:               933
    Min IOPS:               846
    Average Latency(s):     0.0179543
    Stddev Latency(s):      0.0118561
    Max latency(s):         0.250379
    Min latency(s):         0.00742251
    
    rados bench -p test -t 16 60 seq
    Code:
    Total time run:       58.703606
    Total reads made:     53479
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   3644
    Average IOPS:         911
    Stddev IOPS:          33
    Max IOPS:             970
    Min IOPS:             820
    Average Latency(s):   0.0168323
    Max latency(s):       0.425657
    Min latency(s):       0.00464903
    
    Best regards
    Mario
     
  11. LightKnight

    LightKnight New Member

    Joined:
    Mar 7, 2018
    Messages:
    3
    Likes Received:
    0
    @Alwin
    Maybe I didn't explain it enough. The Ceph servers will have two 40gbe NIC cards, one for the Ceph network and one for the public VM network. The VM servers will have one 40Gbe card for the VM network. Everything will be on one switch but the Ceph traffic will be on its own VLAN, separate from the VM VLANs. But that's probably going to change with the idea below. I'm guessing a separate VLAN on a bonded link would still be beneficial?

    @Mario Hosse
    I know the minimum requirements, I'm not sure of the real world requirements. Take the recent Micron MySQL Ceph RBD article for example (can't post links yet, look up "Micron Ceph MySQL"). I know it's a benchmark aimed at maxing out resources, but with only 5 MySQL client servers the 44 core storage servers were at 30% CPU. It made me consider using E5s for my storage nodes but I think I'm going to save some of the budget and go with 8 core Xeon-Ds instead. We're running SATA SSDs not NVMe so the load should be smaller since they're slower. That limits me to one dual 40Gbe NIC card for the Ceph nodes which is fine I suppose. A dedicated dual 40Gbe card for a handful of SATA SSDs is overkill anyways.

    Thanks for the MLAG idea, it's so obvious I can't believe I didn't think of it myself!
     
    #31 LightKnight, Mar 7, 2018
    Last edited: Mar 7, 2018
  12. Mario Hosse

    Mario Hosse New Member

    Joined:
    Oct 25, 2017
    Messages:
    19
    Likes Received:
    3
    @LightKnight
    The ram requirements of ceph are pretty close to the real world see an example of top without load:
    Code:
       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
       3256 ceph      20   0 5711404 4.664g  27048 S   5.9  0.9 170:31.73 ceph-osd
       3383 ceph      20   0 5602124 4.561g  26728 S   5.9  0.9 154:00.58 ceph-osd
       4041 ceph      20   0 5551536 4.499g  26732 S   5.9  0.9 144:38.02 ceph-osd
       3383 ceph      20   0 5602124 4.561g  26728 S   1.7  0.9 154:00.63 ceph-osd
       3256 ceph      20   0 5711404 4.664g  27048 S   1.3  0.9 170:31.77 ceph-osd
       3508 ceph      20   0 5545372 4.506g  26820 S   1.0  0.9 146:08.76 ceph-osd
       3630 ceph      20   0 5616528 4.567g  26592 S   1.0  0.9 140:29.78 ceph-osd
       3917 ceph      20   0 5518576 4.480g  26548 S   1.0  0.9 140:16.43 ceph-osd
       4041 ceph      20   0 5551536 4.499g  26732 S   1.0  0.9 144:38.05 ceph-osd
       3508 ceph      20   0 5545372 4.506g  26820 S   1.7  0.9 146:08.81 ceph-osd
       3630 ceph      20   0 5616528 4.567g  26592 S   1.7  0.9 140:29.83 ceph-osd
       3256 ceph      20   0 5711404 4.664g  27048 S   1.3  0.9 170:31.81 ceph-osd
       3383 ceph      20   0 5602124 4.561g  26728 S   1.3  0.9 154:00.67 ceph-osd
       4041 ceph      20   0 5551536 4.498g  26732 S   1.0  0.9 144:38.08 ceph-osd
       3917 ceph      20   0 5518576 4.480g  26548 S   0.7  0.9 140:16.45 ceph-osd
       2843 ceph      20   0 3240136 541728  22464 S   0.3  0.1  50:46.19 ceph-mon
    
    The CPU load is in normal operation between 3-15% at 48 cores.
    One dual 40Gbe NIC in MLAG-Mode is perfect.
     
  13. Gerhard W. Recher

    Proxmox Subscriber

    Joined:
    Mar 10, 2017
    Messages:
    144
    Likes Received:
    7
    my 2 cents ...
    on 56Gbit/s network configuration see my signature
    Code:
    Total time run:         60.022982
    Total writes made:      41366
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     2756.68
    Stddev Bandwidth:       174.339
    Max bandwidth (MB/sec): 2976
    Min bandwidth (MB/sec): 2228
    Average IOPS:           689
    Stddev IOPS:            43
    Max IOPS:               744
    Min IOPS:               557
    Average Latency(s):     0.0232139
    Stddev Latency(s):      0.00889217
    Max latency(s):         0.246315
    Min latency(s):         0.00900196
    
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. LightKnight

    LightKnight New Member

    Joined:
    Mar 7, 2018
    Messages:
    3
    Likes Received:
    0
    I wonder if forcing lz4 compression in bluestore helps real world performance like it does with ZFS. It would be interesting if someone could run a database benchmark with and without lz4 in a VM.
     
  15. markmarkmia

    markmarkmia New Member

    Joined:
    Feb 5, 2018
    Messages:
    23
    Likes Received:
    0
    Is there a way to hack in support for direct-write EC pools in Ceph? I think the barrier presently is that we can't specify the data pool (since direct write EC pools still need to use a replicated pool for metadata). I feel for smaller networks this might help with throughput (halving the amount of data for replication in some cases). I'm using SSDs with Ceph only for database volumes where I need the IOPS (but not the throughput really) I'd love to be able to reduce network throughput a bit using EC (while saving on disk space). The extra CPU isn't really a concern because my hosts tend to run short on RAM well before CPU. (but I may also have a very different use case than most)
     
  16. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    536
    Likes Received:
    57
    bw=170912KB/s, iops=42728,
    lat (usec): min=19, max=557, avg=23.09, stdev= 2.97

    I'm not using dpdk yet. I've been busy putting out fires :p
     
  17. RonnySharma

    RonnySharma New Member

    Joined:
    Mar 11, 2018
    Messages:
    3
    Likes Received:
    0
    Aprox app has been constantly crashing on android oreo. I would appreciate if you can fix it. Thanks.
     
  18. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,448
    Likes Received:
    387
    this is off-topic, please contact the aprox app developer for help.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?
     
  20. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,829
    Likes Received:
    158
    Hi,
    from my point of view it's not true - you can build an 3-Node ceph cluster without issues.
    One node can fail, without data loss.

    But the downtime of the failed node should not be to long. Because ceph can't remap the data to other osds to reach the replica-count of three again.
    But this depends on the amount of data. Often, in much bigger ceph-setups, it's makes not realy sense to map all data to other nodes, because you are faster to bring the failed node back (spare server...). E.g. if one node have 10 4-TB OSDs you need a long time the rebalance the data across the other nodes.
    And you need the free space on the other nodes of couse!

    But ceph win with more nodes (more speed, less trouble during rebalance).

    Udo
     
    RokaKen, Dan Nicolae and Chicken76 like this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice