Proxmox VE Ceph Server released (beta)

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Jan 24, 2014.

  1. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,481
    Likes Received:
    96
    atto benchmark and iometer using test files from Open Performance test (Available here: http://vmktree.org/iometer/) from a windows guest should provide the requested info.
     
  2. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,323
    Likes Received:
    135
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Oer2001

    Oer2001 Member

    Joined:
    Jul 12, 2011
    Messages:
    43
    Likes Received:
    0
    Yes, atto with different block sizes (4K, 128K, 4M) would be great.
    Iometer - nice to have ;-)

    Regards,
    Oer
     
  4. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    I got a crystal diskmark result from a win7 VM running on our ceph clusters, hardware and network is described here (http://pve.proxmox.com/wiki/Ceph_Server#Recommended_hardware) - all pools are using replication 3.

    Code:
    qm config 104
    bootdisk: virtio0
    cores: 6
    ide0: none,media=cdrom
    memory: 2048
    name: windows7-spice
    net0: virtio=0A:8B:AB:10:10:49,bridge=vmbr0
    ostype: win7
    parent: demo
    sockets: 1
    vga: qxl2
    virtio0: local:104/vm-104-disk-1.qcow2,format=qcow2,cache=writeback,size=32G
    virtio1: ceph3:vm-104-disk-1,cache=writeback,size=32G
    crystal-disk--win7-and-ceph3.png
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    some rados benchmarks on the same cluster (replication 3):

    write speed

    Code:
    rados -p test3 bench 60 write --no-cleanup
    
    ...
    2014-03-10 20:56:08.302342min lat: 0.037403 max lat: 4.61637 avg lat: 0.23234
       sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
        60      16      4143      4127   275.094       412  0.141569   0.23234
     Total time run:         60.120882
    Total writes made:      4144
    Write size:             4194304
    Bandwidth (MB/sec):     275.711
    
    Stddev Bandwidth:       131.575
    Max bandwidth (MB/sec): 416
    Min bandwidth (MB/sec): 0
    Average Latency:        0.232095
    Stddev Latency:         0.378471
    Max latency:            4.61637
    Min latency:            0.037403
    read speed

    Code:
    rados -p test3 bench 60 seq
    
    ...
    Total time run:        13.370731
    Total reads made:     4144
    Read size:            4194304
    Bandwidth (MB/sec):    1239.723
    
    Average Latency:       0.0515508
    Max latency:           0.673166
    Min latency:           0.008432
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    I ran 2 node CEPH cluster in a stressful production environment for last 10 months. No issues. Only recently i added 3rd node to increase performance and we are anticipating growth of our data. Even with 3 nodes you can still use 2 copies.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    I would just like to point out that if you have more than 6 OSDs per node, it is a wise idea to put the journal on the OSDs themselves. As you increase your number of OSDs, putting journal on the same OSD reduces the risk of losing multiple OSDs together. This way you only have to worry about losing OSD and its journal only.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. Oer2001

    Oer2001 Member

    Joined:
    Jul 12, 2011
    Messages:
    43
    Likes Received:
    0
    Hi Tom,

    thank you for your performance tests.
    One question. I think using VM's with writeback disk cache in production systems is not a good decision.
    Please can you change to cache=none and post the CrystalDiskMark results again.

    Thank you very much.

    Regards,
    Oer
     
  9. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    152
    Likes Received:
    1
    the crystal disk benchmarks are not so wow. specially the 4k reads/writes are really poor. i have more or less the same speed (4k) on servers with 2 sata disks (raid 1)
    using replication of 2 performs better?


    also i think using ssd for journal is a risk like mentioned above. it can get even worse when the ssds reach end of life cicle or have some other problems. as they are all the same model and have more or less the same data and read/writes because of replication it can happen that all of them will fail at the same time killing the whole cluster....
     
  10. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    if you run the same benchmark in parallel - e.g. 100 guest you will see a the difference between ceph RBD and your sata raid1. if your goal is a very fast single VM, then ceph is not the winner. a fast hardware raid with a lot of cache, ssd only or ssd & sas hdd´s is a good choice here.

    You need to be prepared for failing OSD and journal disks and you need design your ceph hardware according to your goals. If money is no concern, just use enterprise class SSDs for all your OSD. the really cool feature is that with ceph you have the freedom to choose your hardware according to your needs and you can always upgrade your hardware without downtime. replacing OSD, journal SSD disks, all this can be done via our new GUI (of course, someone needs to plug in the new disk in your servers before).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. Oer2001

    Oer2001 Member

    Joined:
    Jul 12, 2011
    Messages:
    43
    Likes Received:
    0
    Hi Tom,

    can you please do this performance tests.
    it would be very important to me.

    Thank you very much.

    Regards,
    Oer
     
  12. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    Why not? Writeback is the recommended setting for ceph rbd if you want good write performance.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  13. jleg

    jleg Member

    Joined:
    Nov 24, 2009
    Messages:
    105
    Likes Received:
    2
    Hi,

    here's a test using rbd with "writeback":

    2014-01-07 17_48_08-VM 100 ('vm-test-100')_rbd_writeback.png

    and here the same config using "nocache":

    2014-01-08 15_13_38-VM 100 ('vm-test-100')_rbd_nocache.png

    ceph cluster of 3 nodes, using bonded 2GBit for OSD links, and bonded 2GBit for MONs, 4 OSDs per node, SATA disks.
     
  14. zystem

    zystem New Member

    Joined:
    Feb 5, 2013
    Messages:
    19
    Likes Received:
    0
    Feature Request. Add support of disk partitions. Command pveceph createosd /dev/sd[X] can use only WHOLE disk but not disk partition like /dev/sdd4 Clean ceph installation support partitions.
     
  15. Florent

    Florent Member

    Joined:
    Apr 3, 2012
    Messages:
    91
    Likes Received:
    2
  16. m.ardito

    m.ardito Active Member

    Joined:
    Feb 17, 2010
    Messages:
    1,473
    Likes Received:
    12
  17. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  18. Norman Uittenbogaart

    Joined:
    Feb 28, 2012
    Messages:
    145
    Likes Received:
    4
    How far is Proxmox for using OpenVZ on ceph?Either by ploop images or the file system part of ceph?Will either of above solutions be available any time soon?
     
  19. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,531
    Likes Received:
    404
    nothing usable for now but yes, containers on distributed storage would be nice.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  20. mo_

    mo_ Member

    Joined:
    Oct 27, 2011
    Messages:
    399
    Likes Received:
    3
    I just set up a a 3node proxmox cluster thats being virtualized by proxmox. While I can't run KVM VMs on this, I can test pve-ceph. I noticed the status on the webinterface saying HEALTH_WARN, not specifying details. I can only speculate, but maybe this display does not have all the possible circumstances covered yet?

    Anyway, the reason for the health warn is clock skew, meaning the system time of the systems is too far apart (Ceph allows for .05s diffference per default). Since this is a virtualized cluster I have no problem blaming this issue solely on KVM, so this is not a bug report or anything.

    I wanted to leave the following hint however:

    in the [global] section of /etc/pve/ceph.conf you can add

    Code:
     mon clock drift allowed = .3
    to make the test cluster say HEALTH_OK. It may not be a good idea to do this on production clusters but then again, the ceph mailing list does say that setting this to .1 or even .2 should be okay. Additionally, specifying 1-3 local NTP servers in /etc/ntp.conf might help (it did not for me).


    Funny sidenote: Even though this is a virtual testing cluster, "rados bench -p test 300 write" is STILL giving me rates that exceed a single physical disk! This setup is terribly bad for performance, but I am still getting good rates (for such a test anyways...). The pool has size=3, this Ceph has 1GBit networking and the virtual OSDs are stored on some fibrechannel SAN box.

    Write bench is giving me 35MB/s throughput (between 3 ceph nodes, 2 OSDs each)
     
    #60 mo_, Mar 19, 2014
    Last edited: Mar 19, 2014
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice