Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    So what happens when one of the three machines with ceph goes down? Can't it function with just two copies, or does it try to create a third copy on the existing free space?
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    It will stay in a degraded state, while the VM/CT can still access the storage in read/write. All copies are separated on host level, hence there is no recovery as no "4th" host is available to do a recovery.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    Chicken76 likes this.
  3. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    I see. So no data loss and no disruption to running VM/CT, right? Can you still create new VM/CT on the downgraded cluster? Migrating existing VM/CT between hosts is possible on a downgraded (3-host with 1 host down) cluster?

    And another question, if you'll permit: if a second host goes down, the cluster will stop working (lack of quorum) but the data is not lost, as there still is a healthy copy of it, and as soon as a second ceph server is spun up, it will start replicating, right?
     
  4. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Besides those on the failed node.

    Yes, between the remaining hosts.

    From Ceph view, as long as there is a copy and a working monitor, the recovery should happen. The ceph pool(s) stays in read-only as long as the min_size (default = 2) is not met. But you can not join a new node to the proxmox cluster, as it is out of quorum and goes into read-only. Hence the read-only, the VM/CT will not run either.

    So in short, never let it get there. Safes you from a lot of trouble and sleepless nights.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    Chicken76 likes this.
  5. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    Is there no manual method to add another machine into a cluster after 2 out of 3 go down? It is not impossible for a lighting storm to cause a power spike that blows electronic equipment. In case you have un-fixable damage to 2 servers out of 3, if you have spare machines (that were not plugged in) or can get new ones fast, is there no way to restore the cluster?
     
  6. alexskysilk

    alexskysilk Active Member
    Proxmox VE Subscriber

    Joined:
    Oct 16, 2015
    Messages:
    433
    Likes Received:
    48
    data integrity in a replicated environment is verified using a "democratic" process, which is to say that a piece of data is considered to be correct if the majority of "votes" agree that it is so. As a consequence, a minimum number of "votes" for quorum is 2. With a 3 node cluster, a node outage leaves the survivors unable to form a quorum which means that if a piece of data has different values on the surviving PGs that data will be considered dirty. and no one wants that ;)
     
    Chicken76 likes this.
  7. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    So if a block differs between the 2 remaining copies, there are two posibilities:
    1. one of them is corrupted, and this should be sorted out by the checksums (there are checksums for every block a-la-zfs, right?)
    2. one of them has newer data than the other one. Are there timestamps or write logs that could show which is the most recent copy?
     
  8. PigLover

    PigLover Active Member

    Joined:
    Apr 8, 2013
    Messages:
    100
    Likes Received:
    32
    @udo nailed it. Its not a matter of 3 node clusters not working - its a matter of how do you want them to work when there is a failure. You need the "+1" node in order to bring the cluster back to a stable operating state. It should continue to work without it, but you don't want it to stay this way very long.

    This is just a statement about minimal configs. In larger clusters you almost always have many more nodes than your replica set - so no worries.
     
    Chicken76 likes this.
  9. Chicken76

    Chicken76 Member

    Joined:
    Jun 26, 2017
    Messages:
    34
    Likes Received:
    1
    OK guys, thank you very much for your explanations. The conclusion I take from this is that it doesn't function like a ZFS mirror with 1 drive down out of 3. While with ZFS you would be very much OK in this scenario, with ceph being distributed among separate machines things are much more complicated and consensus is much harder to achieve.
     
    #49 Chicken76, Mar 26, 2018
    Last edited: Mar 26, 2018
  10. alexskysilk

    alexskysilk Active Member
    Proxmox VE Subscriber

    Joined:
    Oct 16, 2015
    Messages:
    433
    Likes Received:
    48
    A zfs mirror is very much subject to the same limitations; when down to only one copy you do not have any parity and any disk fault can and will cause corruption or worse. The reason that a zfs mirror is acceptable while a ceph cluster is not is because I'm the scenario you're describing ceph has a node failure domain while zfs has a disk failure domain. Plainly put, there is no potential to operate your zfs file system with a third of its disks missing as part of it's normal duty cycle.
     
  11. dmulk

    dmulk Member

    Joined:
    Jan 24, 2017
    Messages:
    48
    Likes Received:
    2
    Just want to confirm that in the Benchmark document that the latency numbers are reflected in "Seconds" and that we would need to multiply by 1000 to confirm ms...

    Example: 0,704943 = approx 70ms latency

    This seems a bit high for SSD's...
     
  12. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,159
    Likes Received:
    352
    Do you get better numbers?

    A fully replicated network storage depends a lot on the network latency. Compare the number of 10 Gbit and 100 Gbit examples in the PDF.
    And a fast CPU will also help here.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk likes this.
  13. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Seconds to milliseconds, 0.7 s are 700 ms.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk and tom like this.
  14. dmulk

    dmulk Member

    Joined:
    Jan 24, 2017
    Messages:
    48
    Likes Received:
    2

    Sorry, this is what I'm trying to confirm: The numbers in the report with comma's are throwing me off....what are these in ms?

    700ms? 70ms?

    Thanks,
    <D>
     
  15. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    The decimal and thousands separators are "reversed".
    Code:
    German       4 333 222.111,00
    US-English  4,333,222,111.00  
    The rados bench writes the latency in seconds, hence the ~700ms on the chart.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk likes this.
  16. dmulk

    dmulk Member

    Joined:
    Jan 24, 2017
    Messages:
    48
    Likes Received:
    2
    Ah....thank you Alwin. Yeah....this seems worse now...700ms seems extremely high...if that were ns that would be fantastic. I'll need to run the tests in my environment and compare. I suspect I'm seeing better numbers because if I were at 700ms latency I would certainly be "hearing about it" from my devs.
     
  17. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Code:
    1 Gb/s   = 1 s / 1 Gb   = 0.000000001 s for 1 bit
    10 Gb/s  = 1 s / 10 Gb  = 0.0000000001 s for 1 bit
    100 Gb/s = 1 s / 100 Gb = 0.00000000001 s for 1 bit
    
    This is only the theoretical difference in how long 1 bit takes to travel. This doesn't account any hardware, media, hops in between or the software stack.
    https://en.wikipedia.org/wiki/OSI_model
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    chrone and dmulk like this.
  18. fips

    fips Member

    Joined:
    May 5, 2014
    Messages:
    123
    Likes Received:
    5
    second round of my ssd benchmark testing:

    Code:
    SSD Controller   BW    IOPS
    HP SSDF S700 240GB        5171 KB/s    1292
    Hynix Canvas SL300 240GB        5166 KB/s    1291
    Intel DC S3520 240GB    Intel    83085 KB/s    20771
    Plextor PX-256S3C 256GB    Silicon Motion SM2254 / TLC    5045 KB/s    1261
    Seagate XF1230 240GB    LAMD eMLC    86084 KB/s    21521
    Intel 520 240GB        400 KB/s    97
    Kingston DC400 480GB    Phison PS3110-S10 / MLC    1733 KB/s    433
    Intel DC S3520 1,6TB    Intel    80301 KB/s    20075
    Kingston Fury SHFS37A480G 480GB    SandForce SF-2281 / MLC    78618 KB/s    19654
    Sandisk Cloudspeed 2 ECO SDLF1DAR 960GB    Toshiba / MLC    52193 KB/s    13048
    Kingston SHSS37A480G 480GB    Phison PS3110-S10 / MLC    1745 KB/s    436
    SP Velox V70 480GB    SandForce SF-2281 / MLC    1713 KB/s    428
    Sandisk Cloudspeed Eco SDLFNDAR 480GB    Marvell 88SS9187 8Kanal / MLC    81467 KB/s    20366
    Sandisk X600 1TB    Toshiba / TLC    8623 KB/s    2155
    Sandisk Cloudspeed 2 ECO SDLF1DAR 480GB    Toshiba / MLC    50846 KB/s    12711
    Sandisk Cloudspeed SDLFODAR 240GB    Marvell 88SS9187 8Kanal / MLC    42968 KB/s    10741
    
     
    chrone likes this.
  19. flaf

    flaf New Member

    Joined:
    May 20, 2018
    Messages:
    3
    Likes Received:
    0
    Hi,

    I have made an Proxmox cluster version 5.2 with a ceph storage (internal to Proxmox) and I have made a benchmark in a VM with fio. My problem is:

    1. I'm not expert at all in benchmarks (and in fio) and I'm very cautious concerning benchmarks, maybe my benchmark is not relevant at all.
    2. I have absolutely no idea if my results are good or not good (as regards my conf see below).

    So I would appreciate some helps and explanations 1. to make a relevant benchmark and 2. to know if the results are good or not (as regards my conf).

    Thanks a lot in advance for your help. :)
    Regards.

    PS: message edited (MTU in the ceph cluster network added).

    Here is my configuration :

    Code:
    - I'm using Proxmox 5.2 with an internal ceph storage (Luminous) on 3 physical nodes.
    - Each node are strictly identical.
    - It's a server DELL PowerEdge 1U R639: 2 CPUs Intel Xeon E5-2650, 256GB RAM, 10 disks
      (see below for the storage conf).
    - A RAID controller PERC H730P which is set in HBA mode (no RAID).
    - Among the 10 disks, there are 2 SSD 200GB Intel S3710 2.5" in RAID1 ZFS dedicated to the OS.
    - Among the 10 disks, there are 8 SSD 800GB Intel S3520 2.5" dedicated to the ceph storage
      (one disk = one OSD, so 8x3=24 OSDs in all).
    - There is one network card 2x10Gbps SFP+ strictly dedicated to the ceph cluster network.
      I have set a bonding on the two interfaces in active-backup mode with MTU=9000 like this:
    
    auto bond1
    iface bond1 inet static
        address      10.123.123.21/24
        slaves       ex0 ex1
        bond_miimon  100
        bond_mode    active-backup
        bond_primary ex0
        mtu          9000
     
    In the cluster, I have set only one ceph pool "ceph-vm" with (I haven't changed the CRUSH map):

    Code:
    - size = 3
    - min_size = 2
    - pg_num = 1024
    - rule_name = replicated_rule
    
    When I have created the _proxmox_ storage "ceph-vm" (proxmox storage which uses the ceph pool "ceph-vm"), I have set "--krbd=false". After that, I have installed a little VM, a little Debian Stretch with 512MB RAM and I have made a fio bench in the VM:

    Code:
    ### TLNR => read iops ~ 7700 and write iops ~ 2550 ###
    
    root@test:~# cat fio
    [rwjob]
    readwrite=randrw
    rwmixread=75
    gtod_reduce=1
    bs=4k
    size=4G
    ioengine=libaio
    iodepth=16
    direct=1
    filename_format=$jobname.$jobnum.$filenum
    numjobs=4
    
    
    root@test:~# fio fio
    rwjob: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
    ...
    fio-2.16
    Starting 4 processes
    Jobs: 4 (f=4): [m(4)] [100.0% done] [103.9MB/35007KB/0KB /s] [26.6K/8751/0 iops] [eta 00m:00s]
    rwjob: (groupid=0, jobs=1): err= 0: pid=1275: Thu May 17 15:39:01 2018
      read : io=3070.4MB, bw=30821KB/s, iops=7705, runt=102007msec
      write: io=1025.8MB, bw=10297KB/s, iops=2574, runt=102007msec
      cpu          : usr=5.03%, sys=13.49%, ctx=555568, majf=0, minf=9
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=785996/w=262580/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=16
    rwjob: (groupid=0, jobs=1): err= 0: pid=1276: Thu May 17 15:39:01 2018
      read : io=3072.5MB, bw=30767KB/s, iops=7691, runt=102257msec
      write: io=1023.7MB, bw=10250KB/s, iops=2562, runt=102257msec
      cpu          : usr=5.00%, sys=13.52%, ctx=550809, majf=0, minf=8
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=786533/w=262043/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=16
    rwjob: (groupid=0, jobs=1): err= 0: pid=1277: Thu May 17 15:39:01 2018
      read : io=3073.2MB, bw=30869KB/s, iops=7717, runt=101942msec
      write: io=1022.1MB, bw=10275KB/s, iops=2568, runt=101942msec
      cpu          : usr=4.94%, sys=13.58%, ctx=556677, majf=0, minf=7
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=786716/w=261860/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=16
    rwjob: (groupid=0, jobs=1): err= 0: pid=1278: Thu May 17 15:39:01 2018
      read : io=3071.6MB, bw=30763KB/s, iops=7690, runt=102243msec
      write: io=1024.5MB, bw=10260KB/s, iops=2565, runt=102243msec
      cpu          : usr=4.94%, sys=13.60%, ctx=552381, majf=0, minf=7
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=786320/w=262256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=16
    
    Run status group 0 (all jobs):
       READ: io=12287MB, aggrb=123045KB/s, minb=30762KB/s, maxb=30869KB/s, mint=101942msec, maxt=102257msec
      WRITE: io=4096.7MB, aggrb=41023KB/s, minb=10250KB/s, maxb=10296KB/s, mint=101942msec, maxt=102257msec
    
    Disk stats (read/write):
      vda: ios=3144433/1048390, merge=0/28, ticks=3463440/2746060, in_queue=6209176, util=100.00%
    
     
    #59 flaf, May 20, 2018
    Last edited: May 20, 2018
  20. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    What do you expect to be the outcome? And what do you want to test?

    In general, you need a baseline to compare your tests against. Usually a start is, a test of the capabilities of the underlying hardware on its own (eg. fio or iperf). In your case, eg. a fio test against the Intel S3520.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice