Proxmox VE Ceph Server released (beta)

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Jan 24, 2014.

  1. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    152
    Likes Received:
    1
    i tried with fio on a linux guest. same speed. this is also the max speed i get in rados bench for write. with reads i can do a lot more.
    for one vm machine i am really happy about the 500mb write. ande 500mb read. but for the whole cluster its not enogh. when i run 2 vms with write tests the speed is just the half...
     
  2. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you very much for sharing. I don't mean to dwell on this thread, but since we are on the topic of need for speed...

    I am confused. You ran 10GB Ethernet on Ceph nodes. How were you able to achieve 15.7GBytes as indicated on your iperf?

    How do you connect the 40GB cards without going through a switch?

    Thanks again for putting up with my newbie questions.
     
  3. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    the 15.7GB are data which are transfered during the iperf-test - speed was 13.4 Gbit/s.
    iperf shows the max. bandwith between the iperf-nodes.

    The post was the answer of spirits remark "Hi, I don't think you can reach 20gbits with IP over infiband. (I think maybe 4 gigabits max)" and has nothing to do with ceph - as I wrote before the infiniband-connection was used for drbd.
    simply with an cable between the infiniband cards.

    Udo
     
  4. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you. So to connect 3 ceph nodes I need to get the dual ports 40GB for each server and daisy chain them in sequence (server A -> Server -> Server B -> Server C)?

    I am still confused about the switch-less technology. Tried looking up on Mellanox website about Virtual Protocol Interconnect (VPI) and their ConnectX-2 VPI cards, but still quite couldn't understand it fully. Any help would be greatly appreciated. Thanks.
     
  5. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Hello,

    The boot drive to one of my ceph nodes (1 out of 3) just died.

    I understand the failed node will need to be removed entirely. However, I can't seem to be able to remove the node? All documentation I seems so far only give example to replacement of failed OSDs. But what about replacement of the boot drive (Proxmox host)?

    In order words, what is the proper way to remove that failed system entirely from Ceph nodes? Any help or suggestion is greatly appreciated.
     
  6. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,070
    Likes Received:
    24
    You should install proxmox as usual on a new disk. Name it something other than what it was. Them simply join it to proxmox cluster. Cluster should pick up the osds automatically.
    I replaced many ceph only node without issue. Never had to replace proxmox+ceph node yet. But should work.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you so much for the quick reply. I have no problem setting up the new Ceph nodes. It's the old Ceph node that I am having a problem with. The other Ceph nodes are still seeing it as missing.

    It was setup as one of the monitors. It's Quorum status now indicated as "No".

    I've tried stopping or removing the this monitor. It came back with "Connection error 595. No route to host."

    All the OSDs on that nodes are showing status of "down/out"

    I would like to have a clean removal of the old Ceph node which is no longer in existent. What's the best way? Thank you.
     
  8. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,070
    Likes Received:
    24
    What's the output of
    # ceph osd tree
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Ceph stated it's software is "the end of RAID".

    Many of our existent Dell servers have the built-in RAID controller. But we are not using the RAID feature and set each hard drive individual as RAID 0. With this config, Ceph are able to see each drive as separate OSDs.

    The battery on some of these RAID controllers are causing an issue. It seems the batter is only good for the "write back cache" feature. With it removed, the controller defaulted to "write through".

    Dell stated the hard drives without the write back cache feature enabled will suffer performance. Other than that, no problem.

    Since Ceph do not need RAID controllers to operate. The golden questions are:

    Is it better to let the hard drives run in write back cache mode or not? Does it give any advantage to using a RAID controller and run it with this feature? How would CEPH take advantage of it? What would happen if we go to write through, does CEPH have it's own method to perform write back cache or similar to increase performance?

    ***NOTE*** Most hard drives already come with it's own cache, but I do not know how CEPH will utilize those cache.
     
  10. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    root@c2:~# ceph osd tree
    # id weight type name up/down reweight
    -1 21.72 root default
    -2 7.24 host h1
    0 1.81 osd.0 up 1
    1 1.81 osd.1 up 1
    2 1.81 osd.2 up 1
    3 1.81 osd.3 up 1
    -3 7.24 host h2
    4 1.81 osd.4 up 1
    5 1.81 osd.5 up 1
    6 1.81 osd.6 up 1
    7 1.81 osd.7 up 1
    -4 7.24 host h3
    8 1.81 osd.8 down 0
    9 1.81 osd.9 down 0
    10 1.81 osd.10 down 0
    11 1.81 osd.11 down 0
    root@c2:~#
     
  11. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,070
    Likes Received:
    24
    Is host h3 the new node you setup or this is the dead one?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    h3 is the dead one.
     
  13. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,070
    Likes Received:
    24
    Ok. I am assuming your new host is h4, you have already setup Proxmox+Ceph on it and already the node is part of Proxmox cluster. If the HDDs are still in the same bays, try to move the OSDs from old node to new node using the commands below:
    #ceph osd crush set osd.9 1.81 root=default host=h4

    Try with this one osd first and see if the cluster picks up the osd.9 and starts rebalancing. The command above manipulates the CRUSHMap by moving osds.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,070
    Likes Received:
    24
    Which writeback/writethrough cache are you talking about? Proxmox or the RAID controller itself?

    I also suggest that you open a new thread with the problem you are having with dead Ceph node.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. mo_

    mo_ Member

    Joined:
    Oct 27, 2011
    Messages:
    399
    Likes Received:
    3


    you may want to look at this benchmark: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/

    while it is actually trying to determine which IO scheduler is best for several setups, you can also see that the 8xRAID0 setup (meaning: 8 disks as 8 RAID0s) in almost all cases beats the JBOD mode due to the controller write cache.
     
  16. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    You are awesome! It worked. I moved the OSDs using the suggested command, then rebooted the server and the OSDs are now working well on the new server h4.

    How do I remove the dead ceph node? Trying to stop or remove the monitor came back with "Connection error 595: No route to host".

    Should just remove it from the cluster using "pvecm delnode h3"? Is it possible to later rename the H4 to H3 and change it's IP address later to match the previous? My boss want to keep it the same sequence but I am not sure if that's even possible.

    Thanks for your suggestion about a new thread. I thought this post was created as a new thread. Did I not do it correctly?

    Thanks for all of your help.
     
    #236 impire, Jul 7, 2014
    Last edited: Jul 7, 2014
  17. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    I was referring to the RAID controller itself. Is it better to run it with write back cache enabled or Ceph has its own method to handle the write back cache?
     
  18. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you very much. I just read through the link you've provided.

    So in summary, it is better to run the drives as RAID 0 and let the RAID controller use it's own write back cache?

    CEPH stated their software spell "the end of RAID". However, when it come to hard drive performance we still need the RAID controller to perform the write back cache as opposed to depending on CEPH to handle this.

    Any feedback is greatly appreciated. Thank you.
     
  19. mo_

    mo_ Member

    Joined:
    Oct 27, 2011
    Messages:
    399
    Likes Received:
    3
    You don't exactly NEED raid controllers for ceph to function.

    - You need raid controllers if you want more disks in your system than the mainboards controller is offering
    - You can use a raid controller to benefit from its battery-backed RW-caches. Do note that regular hard drives already have its own memory cache, just on a much smaller scale than what controllers have. They really only exist because non-SSD drives are extremely slow...

    Also, while we tend to call them raid controllers, they really actually arent. They are actually just additional disk controllers, which just happen to implement some raid levels (which ceph neither needs nor wants). With ceph you dont use the raid functionality in any way shape or form - this single-disk RAID0 is really just a trick to present individual disks to ceph while still making use of the controller's cache (which the JBOD mode typically doesn't allow for).
     
    #239 mo_, Jul 7, 2014
    Last edited: Jul 7, 2014
  20. sommarnatt

    sommarnatt New Member

    Joined:
    Mar 20, 2014
    Messages:
    22
    Likes Received:
    0
    Have any of you had any issues with Ceph and Live Migrations? I'm not sure whether it's Ceph or something else - but as soon as the Live Migration is successful and the VM is resumed it takes up 100% CPU and is unreachable.
    I've checked the logs and no qemu/kvm error message in them. The VM itself doesn't log anything, as if it doesn't have access to disks or just died right away.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice