Proxmox VE Ceph Server released (beta)

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Jan 24, 2014.

  1. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,064
    Likes Received:
    17
    There is for sure way to divide Ceph cluster for SSD or SAS/SATA. You have to edit CRUSHmap. The steps are following:
    1. Extract CRUSHmap.
    2. Decompile CRUSHmap to txt.
    3. Edit CRUSHmap.
    4. Compile CRUSHmap.
    5. Inject CRUSHmap back in cluster.

    If you are using same node then basically you have to create a virtual host entry in the CRUSHmap and assign the OSDs to it. This is just way overly simplified version of explanation. If you are not familiar with CRUSHmap and syntaxes used in it, then getting to know it first will help tremendously. Regardless you want to separate cluster or not, knowing CRUSHmap will help you greatly to customize your Ceph cluster.

    At this time CRUSHmap cannot be edited through proxmox GUI, it can only be viewed.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,064
    Likes Received:
    17
    Very good documentations on hardware selection for Ceph! Will help anybody seeking deeper understanding on Ceph hardware.

    The kind of hardware that i use will be considered well below standard based on the documentation suggestion. But at least i know what can be achieved with lowest budget possible.

    Note the hardware differences between Performance and Capacity configuration. This was the most valuable piece of information to me. I did not take into consideration that higher the node count and smaller the OSD count per node, recovery/rebalancing will actually be faster. As of now all of my Ceph setups has optimized to use as many OSD possible per node to save Node count and space.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you. I was overly excited for a moment as I thought the WD RED 2.5" also come in a 4TB version.

    I've noticed the WD RED doesn't list RPMs on the drive. It stated 6gb transfer but doesn't list the RPMs. I wonder why.

    Also, this is made for NAS and geared toward home and small businesses. It is a good idea to put them in a enterprise environment?

    Just a thought. Wouldn't it be better to have a system with 6 x 4TB 3.5" drives versus a system with 10 x 1TB 2.5" drives?

    I have been looking at different systems like the Dell C6220 and such. It is so hard to find larger capacity HD in 2.5" size. I gave up on the idea of having chasis that could house 2.5" drives as the price and availability was impractical. But the WD Red looks good and I just hope they come out with a 2TB or 4TB soon, then it would make a lot more sense than trying to fill it with bunch of 1TB drives.

    Thanks for your time and kind effort.
     
  4. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,064
    Likes Received:
    17
    Have you looked at this chassis? This one takes 24 2.5" SSD in one node.
    http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS224-02
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you. What if I put them on separate VLANs but on the same switch? This would mean they won't be fighting for traffic on the same ports. Technically, they will be on separate network, just still on the same switch. Will that still not recommended?

    I am trying to avoid having 2 switches, which mean I would need 4 switches total as I always prefer to have backups. This is just in case the main switches dies. I would rather just have 1 main and 1 backup. But if it mean better ceph traffic, then a separate switch it shall be.

    I was wishful thinking about upgrading to 10GB network. I took one look at current pricing and realized I've forgotten to do one very important thing. That is to go plant a money tree in my backyard and wait for a few seasons Hopefully by then the tree will grow enough money for me to buy the 10gb hardware. We all know the current 10gb gears are way over priced.

    Which lead me to another creative thought. What about fibre channel? It's made for SAN but why not? It's cheap and for the current value may be worthwhile. I see fibre channel switches and nic cards cost next to nothing. They give 2gb-4gb each port. Bond 2 of those 4gb ports and we would get excellent speed at 8gb. Any thoughts?

    Thanks in advance for your help.
     
  6. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you. Pretty awesome chassis. But then again, even at 24 2.5" x 1TB, that's only 24 TB.

    I like your current chassis better. I could fit 10 x 4TB = 40TB. The price between the 1TB 2.5" and the 4TB 3.5" SATA is not that much of a different. I have been sticking to SATA largely because SAS in the larger capacity are pricey at 7200RPM.

    Not too many people can afford 1TB SSD nowadays. 2.5" SATA drives are not available in larger size. But I am glad Mir pointed out the WD Red as I will be watching for future release of larger capacity.
     
  7. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thanks a lot for this. Back in the 90's, I used to build and ship thousands of PCs under my own brand. For the past years, I have been using Dell. But venturing into cloud computing and Ceph, I realized their servers are limited and over priced. Your link and comment inspired me to build an awesome one.

    So with my endeavor to build our dream severs. Questions:

    I know where you get the chassis? But where do you get the rest of the components (motherboards, processors, RAM, RAID cards, etc)?

    Why do you need the Intel 24port Expander Card?

    Have you used any chassis and mother board that could house 4 physical processors? I know Super Micro makes them.

    What if I build a server with a lot of power (4 x 8 cores = 32 cores) and pack the ceph nodes with a lot of RAM (128GB), can I just use it for both ceph nodes and clusters (VMs)?

    or regardless of the horse power of the ceph nodes it is still better to keep the clusters that run the VMs separated?
     
  8. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Does this also mean the read/write process will be faster if you spread the OSDs across the ceph nodes rather than packing each nodes full of OSDs?

    Does this also mean it's better to have 2 x 4TB than 8 x 1TB OSDs in each nodes?
     
  9. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,481
    Likes Received:
    96
    5400 RPM -> http://www.amazon.com/Western-Digital-Cache-Drive-WD10JFCX/dp/B00EHBES1U
    Performance wise you get higher throughput with 10 disks instead of 6 disks. Reads can theoretically be spread across 10 disks which will give 10 x read speed of 1 disk instead of 6 x read speed of 1 disk.
     
  10. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,505
    Likes Received:
    21
    for affordable 10gb check Infiniband . We use that thanks to active forum support. And here is the wiki link: http://pve.proxmox.com/wiki/Infiniband .

    prices on ebay go up and down... cables usually ship from China .
     
  11. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you so much for the info.

    I don't see a lot of NIC cards available out there. Which NIC cards do you use (single, dual, quad ports)?

    Does it require any special driver for Proxmox/Debian to see the NICs? I also have a problem with third party NICs and Debian needing the proper driver for it.

    What option to you choose for the VMs (e1000, virtio, etc.)?

    I also read that 10gb is limited by the type of processors and hard drive that you have. Current max speed of hard drive throughput is 6Gb/s. Are you experience full 10gb speed?

    Thanks in advance for your help?
     
  12. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Can you also share the part / model number of the switch and NICs you are using in production? Thanks.
     
  13. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,505
    Likes Received:
    21
    1- the drivers are built into the kernel .

    2- we use infiniband for cluster / ceph network. vmbr0 etc for vm's

    3- we have not got to speed testing ceph yet...

    4- for next msg in thread we use these model cards, per cli command lspci MT25208 , MT25418 and MT25208 .

    Code:
    02:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
    
    03:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)
    
    ib is very easy to set up. see wiki and ask questions, but use a diff thread please.
     
  14. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you very much. Just one last question regarding this topic. May I ask for the brand and model of the switch you are using? I see both Melannox and Flextronics. Some of these go for 20gb or 40gb speed. I also wonder if the higher speed switch can auto-sense the lower speed NICs. Thanks for your help.
     
    #174 impire, Jun 19, 2014
    Last edited: Jun 19, 2014
  15. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    152
    Likes Received:
    1
    hi,

    i purchased 3 server for ceph with 2 10gig nic each server.
    and two 10g switches. unfortunally the switches are not stackable (too expensive)
    what the best way to have fault tolerancy?
    10g network bonded for monitors
    10g network bonded for osd
    without stackable switch i cannot make any 20gig bond i think...?

    thank you
     
  16. aderumier

    aderumier Member

    Joined:
    May 14, 2013
    Messages:
    203
    Likes Received:
    18
    use active-backup bonding. (so only 1 link of the bond will be used)

    if you can have 4 10gb nic by ceph server,
    then configure ceph to use public network (vm->osd), and private network (osd->osd replication).
    monitor don't need dedicated bandwidth.
     
  17. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    What about the ProxMox host? Is it advantageous to have it on the 10Gb network?

    In the below scenario, the VMs are communicating directly with the Ceph nodes (vm -> OSD). Is it worthwhile to have a fast network on the ProxMox host itself or is it a waste of 10gb ports? It seems the processing bandwidth is strictly between the VMs and and the Ceph nodes.

    ProxMox Host = Subnet A (1gb network)

    VMs = Subnet B (10gb network)

    Ceph Nodes = Subnet C (10gb network on separate switch).
     
  18. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    I am curious. Why do we need to much bandwidth when the hard drives throughput and transfer rate can only be at 3Gb/s or 6Gb/s? I read in various forums that even high end graphic house that run these 10gb network are getting the bottle neck at hard drives or processor level.

    Unless we run a network of beefed up servers with many cores and bunch of raid drives, I don't see how we can ever take advantage of a full 10gb network. Even at 20gb/s the only thing that can use it fully would be to run the VMs on a RAM disk.
     
  19. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    158
    Hi,
    the transfer speed is one thing, but for an storage network the latency is also (or more) important. And the latency on an 10GB Network is much less than on an 1GB-Network (and 10G Base T is not so good like an SFP+-Network).

    Udo
     
  20. impire

    impire Member

    Joined:
    Jun 10, 2010
    Messages:
    106
    Likes Received:
    0
    Thank you Udo.

    Is there a way to see how much bandwidth the Ceph Nodes are using up? The only way I can think of right now is go to the switch and use the web gui bandwidth monitoring tool.

    It would be nice to be able to see a graphical performance bar for the Ceph nodes from ProxMox GUI.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice