Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,845
    Likes Received:
    159
    Hi,
    you read the data, which you write before (from this node) to the pool - if you have read all available data, the benchmark stop.
    Due to faster reading than writing, the job is done in 32 seconds.

    Udo
     
  2. victorhooi

    victorhooi Member

    Joined:
    Apr 3, 2018
    Messages:
    138
    Likes Received:
    7
    Got it.

    Is there any way to figure out what the bottleneck is in the above? (E.g. network, storage drives, or RAM etc) Or if we've hit some hard limitation in Ceph at this scale etc.
     
  3. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,538
    Likes Received:
    221
    You reached your network limits, compare the results from our benchmark paper. To really get the IO/s out of your NVMe drives, you should consider upgrading to 40GbE or even 100GbE (3 nodes, no switch needed).

    Possibly due to the read limitation of your LVM storage, but this is just a shot in the dark.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. Alexander Marek

    Alexander Marek New Member
    Proxmox Subscriber

    Joined:
    Apr 6, 2018
    Messages:
    8
    Likes Received:
    0
    Did anybody compare the SM883 with SM863?
    Seems like SM863 is not available on the market anymore!

    I guess performance is approximately the same because it is just a newer modell?

    Thank you in advance

    BR
     
  5. Ronny

    Ronny Member

    Joined:
    Sep 12, 2017
    Messages:
    37
    Likes Received:
    0
    and what is with the Samsung PM883 - any experience with this one?

    regards
    Ronny
     
  6. fips

    fips Member

    Joined:
    May 5, 2014
    Messages:
    141
    Likes Received:
    5
    Here the results of my last benchmarks:

    Code:
    Model        Size    TBW    BW        IOPS
    Intel DC S4500    480GB    900TB    62,4 MB/s    15,0k
    Samsung PM883    240GB    341TB    67,2 MB/s    17,2k
     
  7. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    66
    Likes Received:
    5
    Is the next limitation of Ceph is true:
    ~10k IOPS per OSD
    ?
     
  8. Tacid

    Tacid New Member

    Joined:
    Aug 30, 2018
    Messages:
    3
    Likes Received:
    2
    I've got 15k-16k random write IOPS in 16 threads with 4k blocks per bluestore OSD on 40G IB net, but that is the good result, 10k per OSD is not bad. With io-thread=1 I can get only 1400-1600 IOPS of random writes.
    The problem here is the OSD code latencies an WA produced on every operation. OSD itself can take 700 μs (0,7ms) just to execute one IO operation, so even on RAM disk, where kernel IO operations in <10μs you can't barely reach 3k IOPS with a best high freq CPU.

    P.S. Random read is about 35-50k IOPS on the same system, but all of them is just OSD performance mesure (data was read from OSD cache, when testing not disk IO is done)
     
  9. oversite

    oversite New Member

    Joined:
    Jul 13, 2011
    Messages:
    3
    Likes Received:
    0
    I got less than expected results from sm883 (i thought they would be at least as good as SM863) so I ended up using PM963 and even more so PM983. I cannot tell really if it's the nvme or the ssd themselves but these are also considered to read intensive disks but i get much better performance and low latency compared to the sm883. I did not study very much but i suppose the SM 883 and 863 are MLC and the PM983 TLC byt nevertheless they work much better form me used by ceph. I am using the 1TB and one osd on each, no db or wal.
    /Hans

     
  10. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    66
    Likes Received:
    5
    Thanks, but no this is not good... Currently single method to maximize utilization of NVMe device (for example Optane with ~500k iops rw with 4kb blocks) is to split more than 10-50 OSD per device. And of course we need wait implementation of io_uring: https://github.com/ceph/ceph/pull/27392
     
  11. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,538
    Likes Received:
    221
    Or using dpdk to access the NVMe devices directly.
    https://github.com/ceph/dpdk
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    Alibek likes this.
  12. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    66
    Likes Received:
    5
    MikeWebb likes this.
  13. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,538
    Likes Received:
    221
    Software with dpdk in mind needs to be recompiled for the software version used and therefore is not a stock solution.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    66
    Likes Received:
    5
  15. Paspao

    Paspao Member

    Joined:
    Aug 1, 2017
    Messages:
    51
    Likes Received:
    1
    Hello,

    I am a little confused by IOPS results in the Benchmark PDF, around 200 Write IOPS on 10Gb and around 300 IOPS on 100Gb/s net?

    Then in this thread people talk about reaching thousands of Write IOPS on Ceph, what am I missing?

    If I have 30 LXC that work with large number of small files better considel local SSD instead of Hyperconverged Ceph?

    Thank you.
    P.
     
  16. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,538
    Likes Received:
    221
    The IOps are for 4 MB objects and a runtime of 60 sec.

    The rados bench (in these tests) is a single client with 16 threads, while a cluster usually has multiple concurrent clients accessing Ceph.

    Possible this depends on the IOps and latency requirements of the workload. For Ceph, the containers us KRBD and have the page cache available to them. This can greatly increase the performance. But in all, you need to test and see which solution works better for the workload.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  17. Rosario Contarino

    Rosario Contarino New Member

    Joined:
    Jul 16, 2019
    Messages:
    24
    Likes Received:
    0
    Hi all,
    we have been benchmarking CEPHFS with the following configuration:
    4 x Dell PowerEdge R730xd, 2 x CPU each with 96GB RAM each and 12 x 4TB HDD each, retail 10Gbps.

    Then we built a container (*) with CentOS and ran our benchmarks on its local disk, which was a virtual disk within CEPHFS, and we already got impressive results. (we are already outperforming Nutanix for files up to 8MB, three times faster than a classic VMWare 6.5 with a Dell/EMC SAN.

    We are now procuring SSDs to replace all our 48 HDDs and we will run our final tests and will publish them in here.

    Our goal is to reach at least 600MB sustained writing for files up to 1TB to outperform our GPFS systems.

    Question I have is:
    We would like to mount (using the kernel module) CEPHFS within our CentOS 6.10 container and then run the benchmark on that mounted CEPHFS.
    Could you please point us out on the documentation describing how to mount CEPHFS from ProxMox into a CentOS machine?

    Thank you
    Rosario

    (*) side note. After restarting the cluster a couple of times we are no more able to access the console of the container. Any suggestion?
     
  18. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,538
    Likes Received:
    221
    I don't think a direct mount into a container will work, as the mount happens more than once. Better mount the CephFS on the node and use a mountpoint pointing to the CephFS storage.

    Better open up a new thread for this, makes it more visible.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. Rosario Contarino

    Rosario Contarino New Member

    Joined:
    Jul 16, 2019
    Messages:
    24
    Likes Received:
    0
    Hi Alwin, thanks for your prompt reply. Could you please elaborate a bit more on this? In our future scenario we would like to have CEPHFS mounted by several containers, each probably running on different nodes, each accessing CEPHFS (different paths though). How would you recommend us to do it?

    EDIT
    We are now procuring the SSDs you chosen for your benchmark. If you could give us bit more directions on how you recommend to access CEPHFS from containers that would be awesome. Thank you.
     
    #119 Rosario Contarino, Aug 14, 2019 at 14:18
    Last edited: Aug 15, 2019 at 13:29
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice