Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. flaf

    flaf New Member

    Joined:
    May 20, 2018
    Messages:
    3
    Likes Received:
    0
    Hi Alwin and thanks for your advice concerning the principle of a baseline comparison.

    In fact, I don't know exactly the performance I can expect but I have made another fio benchs (I have read some docs about fio) and I have the feeling that my new benchs are not really good as regard my underlying hardware (see below).

    So, I have tested a simple SSD Intel S3520 (a "OSD" disk) as you said with this fio command (the SSD was formated in EXT4):

    Code:
    fio --bs=4k --size=4G --numjobs=3 --time_based --runtime=40 --group_reporting --name myjob \
        --rw=randwrite --ioengine=libaio --iodepth=8 --direct=1 --fsync=1
    
    and I obtain approximatively 6261 iops.

    After that, I have tested exactly the same fio bench in a Debian Stretch VM (which uses the Ceph storage of course, with a VirtIO disk and 512MB for the RAM) and I obtain approximatively 362 iops.

    1. It seems to me not very good as regard my hardware configuration (I have 24 SSD ODS!). Am I wrong?
    2. Which another tests/benchs can I try?

    Just for memory, here is my configuration and the complete output of the fio benchs in the attachment file promox-5.2-ceph-luminous-conf.txt.


    Thx in advance for your help.
     

    Attached Files:

  2. mada

    mada Member

    Joined:
    Aug 16, 2017
    Messages:
    98
    Likes Received:
    2
    Is that possible to doing drive recovery at different spirt private network? Instead of killing the performance while in production?

    I’m testing ceph with intel 900p as journal will update update later with the testing.
     
  3. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @flaf, what are you trying to measure? The filesystem on the device introduces a layer in between that doesn't represent the capabilities of the underlying disk (especially that ext4 is doing badly with sync writes). Further, you should revise the fio options, eg. why to use randwrite or iodepth/numjobs and what to expect from it. See the fio tests in the benchmark pdf for comparison.

    For your benchmarks, you need to look at the big picture and go from top-down or bottom-up through the whole stack. This means, that you test the hardware and software of your cluster. The different layers also work differently, eg. ceph uses 4 MB objects (left alone the network), while disks may use 512 KB (depends on the disks blocksize). So your tests of these layers need to be different too.

    In this thread, you will find different types of hardware to compare your results against. Use the benchmark paper as guideline for a good comparison of the results.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. flaf

    flaf New Member

    Joined:
    May 20, 2018
    Messages:
    3
    Likes Received:
    0
    Hi,

    Thx for your answer @Alwin.

    In fact, 1) I just would like to know the random write iops I could reach in the VMs of my Proxmox cluster and 2) I would like to understand why the option --fsync=1 lowers the perfs to this point.

    I admit, I probably need to learn more about fio (and I/O on Linux). I try but it's difficult to find good documentation. ;)

    I am a little "annoyed" because I have a colleague who has made a vmware/vSan cluster exactly with the same hardware configuration. With vSan, the storage mechanism is different than Ceph (it's based on a kind of network RAID1 based on dedicated SSD cache disks, the 3710 SSDs) but the perfs are better in a VM (~1200 iops in an identical Debian VM with the same fio test).

    I will dig... ;)
     
    #64 flaf, May 28, 2018
    Last edited: May 28, 2018
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    fsync makes sure the data is written to disk and doesn't land in any cache. It is to the underlying system to honor the fsync or ignore it. As the vSAN is using a dedicated cache disk and works differently then ceph, the starting points are not equal. Also you are using Intel S3520, they are slower then the S3710. To compete against your colleague, you may need to learn both storage technologies. ;)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. mada

    mada Member

    Joined:
    Aug 16, 2017
    Messages:
    98
    Likes Received:
    2
    Do some test

    Dual E5-2660
    75 GB RAM
    SM863 OS Host
    Dual port Mellanox 56Gb/s
    3 x OSD 5TB Hard drive Per server 9 total OSD
    1 x P3700 Journal per node 3 total

    Code:
    osd commit_latency(ms) apply_latency(ms)
      8                 65                65
      7                 74                74
      6                 52                52
      3                  0                 0
      5                214               214
      0                 70                70
      1                 85                85
      2                 76                76
      4                196               196

    the test result with rados bench -p test 60 write --no-cleanup

    Code:
    Total time run:         60.319902
    Total writes made:      3802
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     252.122
    Stddev Bandwidth:       19.7516
    Max bandwidth (MB/sec): 284
    Min bandwidth (MB/sec): 212
    Average IOPS:           63
    Stddev IOPS:            4
    Max IOPS:               71
    Min IOPS:               53
    Average Latency(s):     0.253834
    Stddev Latency(s):      0.131711
    Max latency(s):         1.10938
    Min latency(s):         0.0352605
    rados bench -p rbd -t 16 60 seq

    Code:
    Total time run:       14.515122
    Total reads made:     3802
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1047.73
    Average IOPS:         261
    Stddev IOPS:          10
    Max IOPS:             277
    Min IOPS:             239
    Average Latency(s):   0.0603855
    Max latency(s):       0.374936
    Min latency(s):       0.0161585
    rados bench -p rbd -t 16 60 rand

    Code:
    Total time run:       60.076015
    Total reads made:     19447
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1294.83
    Average IOPS:         323
    Stddev IOPS:          20
    Max IOPS:             364
    Min IOPS:             259
    Average Latency(s):   0.0488371
    Max latency(s):       0.468844
    Min latency(s):       0.00179505

    iperf -c

    Code:
     iperf -c 10.1.1.17
    ------------------------------------------------------------
    Client connecting to 10.1.1.17, TCP port 5001
    TCP window size: 2.50 MByte (default)
    ------------------------------------------------------------
    [  3] local 10.1.1.16 port 54442 connected with 10.1.1.17 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec  24.5 GBytes  21.1 Gbits/sec
     
  7. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    Hi guys,

    Here's a short summary of some tests led in our lab.

    Hyperconverged setup.

    Server platform is 4x :

    Lenovo SR650
    2x Intel Silver 4114 10 cores
    256 GB RAM @2666Mhz
    1X Embedded 2x10Gbps Base-T LOM (x722 Intel) #CEPH
    1X PCI-E 2x10Gbps Base-T adpater (x550 Intel) #VMBR0
    For each server disk subsystem is :
    2x32Gb NVMe on M2 RAID 1 Adapter for OS
    3x 430-8i HBA's (8 HDD/HBA)
    8x 480GB Micron 5100 Pro SSD
    16x 2,4TB 10K RPM Seagate Exos

    Switches are 2x Netgear M4300-24x10Gbps Base-T (4x10Gbps Stacking)

    SETUP SPECIFICS

    CEPH network is a 2x10Gbps LACP
    CEPHX is disabled
    DEBUG is disabled
    Mon on nodes 1,2,3

    SSD gathered in a pool as writeback cache for the SAS POOL.

    No specific TCP Tuning, No JUMBO FRAMES


    TEST PURPOSE : Test VM HDD subsystem itself with a quick and dirty vm (there are tons of them out there)?

    Sample VM : Worst case scenario
    8 vCPU
    4096 MB RAM
    Ubuntu Linux 16.04.03 updated
    No LVM
    HDD with NO CACHE, Size 200GB
    All in one partition ext4


    TEST ENGINE : fio

    For baselining :

    One VM issuing I/O without any other VMs running.

    BLOC SIZE 4K

    60K 4K Iops randread submillisecond
    30K 4K Iops randwrite submillisecond

    TROUGHPUT TEST 1M Blocksize write
    Approximately 985MB/s steady on a 120GB test file


    For reference :
    4K writes are uncommon pattern, usually we see apps that writes way larger blocks, so we tested it with a IO size of 32K against a 64GB file :
    30K iops rand read (10Gbps link is the bottleneck, LACP will not apply with just one vm issuing I/O on a single pipeline)
    20K iops rand write for 620MB/s



    12 CLONES TEST :
    At this time LACP kicked in to break the 10Gbps single link speed

    3x VM on each host issuing the same fio test

    So with 12 VM's issuing I/O concurrently, and because of caching we have a small ramp time of less than a minute :

    I/O are issued against a 64GB file on each VM

    rapidly 300K randread IOPS, peaks at 466K, 344K AVG for 4K
    instant 40K 4K randwrite IOPS, peaks at 70K, 66K AVG

    Again we tested against 32K randwrites and had an average of 44K iops for 1,31 GB/s of troughput

    We are currently gatering more data and tweaking the platform.

    Regards,
     
    DerDanilo likes this.
  8. Knuuut

    Knuuut Member

    Joined:
    Jun 7, 2018
    Messages:
    54
    Likes Received:
    3
    How did you do that?

    Confusing:

    I wonder if the Micron SSDs have been setup as seperate wal/db devices

    for the Seagate OSDs.

    - or -

    The Micron SSDs have been setup as OSDs in a seperate Pool?

    Regards
     
  9. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    I did it like this :

    one pool with all the SSD one pool with all the HDD

    Then assign the SSD pool named cache to the HDD pool named data

    Code:
    ceph osd tier add data cache
    Assign cache policy :

    Code:
    ceph osd tier cache-mode cache writeback
    To issue I/O from the SSD pool :

    Code:
    ceph osd tier set-overlay data cache
    And set

    Code:
    ceph osd pool set cache hit_set_type bloom
    The SSD are not set as wal/db devices for the spinning drives, maybe it would help, i should try. The HDD are hybrid so i don't know if it will help much. Anyway it is worth a try i guess.

    Regards,
     
    Knuuut likes this.
  10. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @Datanat, do you have any rados bench results?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    @Alwin no i just tested the vm. Il will use the same commands you used in the official benchmark and post the results.
     
  12. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    So here are the results :

    I/Os are issued on SSD pools :

    Code:
    rados bench -p cache 60 write -b 4M -t 16
    
     sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
       60      16     16251     16235   1082.18      1084   0.0415562   0.0590952
    Total time run:         60.048231
    Total writes made:      16251
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     1082.53
    Stddev Bandwidth:       29.4887
    Max bandwidth (MB/sec): 1156
    Min bandwidth (MB/sec): 988
    Average IOPS:           270
    Stddev IOPS:            7
    Max IOPS:               289
    Min IOPS:               247
    Average Latency(s):     0.0591144
    Stddev Latency(s):      0.0183519
    Max latency(s):         0.29716
    Min latency(s):         0.02596
    Cleaning up (deleting benchmark objects)
    Removed 16251 objects
    Clean up completed and total clean up time :2.970829
    Now we issue I/O directly to the data pool

    Code:
    rados bench -p data  60 write -b 4M -t 16
    As intended IOs are redirected to the writeback pool :

    Code:
    ceph osd pool stats
    pool cache id 3
      client io 1050 MB/s wr, 0 op/s rd, 525 op/s wr
      cache tier io 262 op/s promote
    
    pool data id 4
      nothing is going on
    So same results here :


    Code:
     2018-07-06 08:25:37.224862 min lat: 0.0261885 max lat: 0.293642 avg lat: 0.0585165
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
       60      16     16414     16398   1093.04      1084   0.0575738   0.0585165
    Total time run:         60.069431
    Total writes made:      16415
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     1093.07
    Stddev Bandwidth:       28.9973
    Max bandwidth (MB/sec): 1156
    Min bandwidth (MB/sec): 1016
    Average IOPS:           273
    Stddev IOPS:            7
    Max IOPS:               289
    Min IOPS:               254
    Average Latency(s):     0.0585415
    Stddev Latency(s):      0.0181863
    Max latency(s):         0.293642
    Min latency(s):         0.0261885
    Cleaning up (deleting benchmark objects)
    Removed 16415 objects
    Clean up completed and total clean up time :3.163250

    It seems like we can only saturate one link wereas muliple vms issuing ios concurrently can break this barrier.

    Any experience on that ?
     
  13. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    The caching pool has good performance. The real write throughput will show once there is I/O on the cluster. As with VMs most data seems to be hot data and never leaf the caching pool, hence not freeing up space for caching new I/O.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    Yes @Alwin you are right,

    We will need to tweak this to get a more 'real life' scenario.

    In fact it is written 'cache' in Ceph's documentation but it looks more like a tiering system.

    By default it seems that there is no dirty object evictions until the cache pool is full. So eventually, with SSD fully filled, the performance dropdown would be absurd.

    I will toy around with this : http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/


    Best regards,
     
  15. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  16. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    Thanks @Alwin.
    I will read this carefully
     
  17. Datanat

    Datanat New Member
    Proxmox VE Subscriber

    Joined:
    Apr 11, 2018
    Messages:
    16
    Likes Received:
    3
    @Alwin here are the raw perfs of the 64xSAS pool

    Destroyed the pools, created only an HDD pool

    Issued a lot of write threads :

    Code:
     rados bench -p testsas  180 write -b 4M -t 1024 --no-cleanup
    
    Code:
    2018-07-06 14:51:45.695910 min lat: 3.33414 max lat: 4.06629 avg lat: 3.67097
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
      180     928     50156     49228   1093.79      1488     3.33424     3.67097
    Total time run:         180.169509
    Total writes made:      50156
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     1113.53
    Stddev Bandwidth:       173.51
    Max bandwidth (MB/sec): 1488
    Min bandwidth (MB/sec): 0
    Average IOPS:           278
    Stddev IOPS:            43
    Max IOPS:               372
    Min IOPS:               0
    Average Latency(s):     3.63559
    Stddev Latency(s):      0.301269
    Max latency(s):         4.06629
    Min latency(s):         0.170923
    

    Otherwise it is is near SSD perfs.

    Code:
    rados bench -p testsas  60 write -b 4M -t 16 --no-cleanup
    
    Code:
    2018-07-06 15:05:22.944918 min lat: 0.0238251 max lat: 0.305691 avg lat: 0.0673414
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
       60      16     14261     14245   949.533       932   0.0568388   0.0673414
    Total time run:         60.075153
    Total writes made:      14261
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     949.544
    Stddev Bandwidth:       28.6206
    Max bandwidth (MB/sec): 1004
    Min bandwidth (MB/sec): 876
    Average IOPS:           237
    Stddev IOPS:            7
    Max IOPS:               251
    Min IOPS:               219
    Average Latency(s):     0.0673869
    Stddev Latency(s):      0.0214755
    Max latency(s):         0.305691
    Min latency(s):         0.0238251


    Anyway we put a few terabytes of data on the platform so we are still short stroking the disks and it will not be the same as many concurrent io patterns issued by vms.

    I filled a HDD @80% to test the performance degradation.

    With the hybrid system helping we see like 625 - 1800 randwrite iops per drive and with larger IOs we can peak @270MB/s per drive.

    Filled at 80%, it provide only 144MB/s and a max of 320 iops and latency rises up a lot.

    Regards,
     
  18. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    You may try to use the SSDs asl DB/WAL devices, so a multiple of HDDs can share a SSD. This will benefit the small writes, as they go to the SSD.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. tuonoazzurro

    tuonoazzurro Member

    Joined:
    Oct 28, 2017
    Messages:
    54
    Likes Received:
    1

    If you out your p420 in hba, from what are you booting? Where did you install proxmox (p420 in hba cannot boot from It)
     
  20. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    I installed Proxmox on the first drive and then manually installed the bootloader onto a USB stick.
    Then I configured the server to boot from the USB stick since it cannot boot from any drive on the controller when in HBA mode.

    This was a lab setup so I didn't bother with Software RAID1 for proxmox installation.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice