Number of nodes recommended for a Proxmox Cluster with Ceph

Discussion in 'Proxmox VE: Installation and configuration' started by alessice, Jan 23, 2019.

  1. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Hi,

    how many nodes do you reccomend for a Proxmox Cluster with Ceph (HCI mode)? We would like to start with 4 node with 6 SSD disk each, so will have 6 OSD per node and PVE OS on SATA Dom.

    The other option is 8 node with the same 6 SSD disk on each.

    Is fine to start with 4 node? Somebody suggest me to not use 4 node for possible quorum problems.

    Thanks
     
  2. tim

    tim Member
    Staff Member

    Joined:
    Oct 1, 2018
    Messages:
    96
    Likes Received:
    9
    That depends on what you intend to do with it, more nodes mean you have more nodes to work with.

    It is not really an issue to have 4 nodes but like in all quorum based systems you don't gain more redundancy with an even number of nodes. You need to keep a majority to make decisions, so in case of 4 nodes you can lose just 1 node and that's the same with a 3 node cluster. On the contrary if you have 5 nodes you can lose 2 of them and still have a majority.

    3 nodes -> lose 1 node still quorum - > lose 2 nodes no quorum
    4 nodes -> lose 1 node still quorum -> lose 2 nodes no quorum
    5 nodes -> lose 1 or 2 nodes still quorum -> lose 3 nodes no quorum
    ...

    So as you can see, the redundancy increases with 5 nodes (that would be the same with 7, 9 and so on) that's why they suggested not to use a even number.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Thanks Tim,

    I'm evaluating 4 or 8 node because hardware will be SuperMicro Twin where for each 2U case we have 4 server inside.

    With 8 node with 6 SSD each we will have a total of 48 SSD (so will be able to setup 48 OSD). Is this a good configuration for a PVE 5.3 in HCI mode with Ceph?
     
  4. tim

    tim Member
    Staff Member

    Joined:
    Oct 1, 2018
    Messages:
    96
    Likes Received:
    9
    If you have the opportunity to get 8 of them and the workload will remain the same, that would undoubtedly be better. As it is always better to have more, but as said in my first post it's all about the intended use case. If someone asks me if he should get 4 or 8 I would recommend to get 8 no questions asked, but that's not a possibility most of the time and maybe sometimes not necessary, but better have it than lack it.
    Maybe you can give us a hint what you are going to do with it?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Currently we have VMs (for now 40 but will grow) hosted by an Hosting Provider. We need to migrate from a public cloud to a private infrastructure so we are evaluating PVE. We are not interesting into have a dedicated external storage via iSCSI, NFS so we are looking for Ceph.

    PVE node will be (x8):

    2 x Intel Xeon 10 core
    196GB of RAM
    6 SSD Intel (480 or 960GB)
    2 SATA Dom for OS
    4 x 10 Gbit Ethernet
     
  6. tim

    tim Member
    Staff Member

    Joined:
    Oct 1, 2018
    Messages:
    96
    Likes Received:
    9
    With 6x480GB you will end up with about ~6.8TB usable space considering a replica of 3. That's about 168 GB per VM, which could be to less, but maybe it's enough for you. With 960GB you will end up, as you might guess, with about 13.5TB.
    I don't know the exact model of your SSD's but if I assume a lower speed of 500MB/s and you have 6 of them this means if all of them write you will end up with 24Gb/s. Take this into account when thinking about your network setup, if your SSD's are faster you could have a bottleneck in your network setup.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,845
    Likes Received:
    159
    Hi,
    this is an very optimistic calculation, because ceph run's in trouble if an osd are nearly filled.
    With enough OSDs you should not filled more than 70% (to left space free for an failed osd/node).
    With this calculation you can use 9.8TB with 960GB SSDs (0.87.. TB in real life).
    Code:
    6*8*0.873046875/3*0.7
    9.778125
    
    Udo
     
  8. tim

    tim Member
    Staff Member

    Joined:
    Oct 1, 2018
    Messages:
    96
    Likes Received:
    9
    With my calculation ceph would recover if 1 node fails, in udo's example you have more fault tolerance thanks for pointing that out!
    In both examples if something fails it's not meant to be ignored, so you have to act anyway (replacing osds, nodes..).
    The whole point with this is you need that extra space so ceph can redistribute the faulty osd's to the remaining ones to get back to 3 replicas.
    You could use even more than 13,5 TB, which I definitely don't recommend, but in an error case your ceph pool won't recover because it can not replicate anymore.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Thanks to all for informations. Our VM are small but after your example I understand that 480GD for SSD are too small, so we evaluate at least 960GB SSD.
     
  10. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Hi,

    based on my budget I have update my configuration with 6 x 1.92TB SSD Intel D3-S4510 on each of 8 nodes for a total of 48 SSD and 192GB of RAM per node.

    My question is, how usable space can I consider for a safe enviroment with x3 replica?

    For RAM, can I consider 64GB of reserved Ram for Ceph and Proxmox operations and 128GB available for VM?

    Thanks
     
    elmacus likes this.
  11. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    572
    Likes Received:
    61
    Odds are that with this config, each system's SSDs will cost roughly half the system (node) cost. I would suggest that this may not be the best way to use your budget- not all your VM data needs to be on ceph SSDs, especially considering you have 40VMs or so. I would suggest taking a hard look at your data needs and deploy a NAS for the non time critical data/media/etc- this will allow you to deploy 4 nodes of ceph OSD nodes plus a fifth, full size node housing large HDDs. This node can serve your slow storage via nfs/iscsi/smb, PLUS act as a monitor- or, if you're really paranoid about sustaining two node failure it can house a mix of OSDs and HDD for NAS purposes simultaneously but I wouldnt suggest it.
     
  12. alessice

    alessice New Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2015
    Messages:
    13
    Likes Received:
    1
    Hi alexskysilk,

    thanks for your suggestions that are probably true. SSD are half of my budget.

    I update my configuration like this for each of 8 nodes:
    • CPU 2 x Intel Xeon 4114 10C/20T
    • RAM 12 x 16GB (192GB)
    • 6 x SSD Intel D3-S4510 960GB
    • 4 x 10Gbit SPF
    • 2 x 128GB SATADOM for Proxmox
    and create 64 VMs each with about:
    • 4 vCPU
    • 16GB RAM
    • 150GB HDD
    with a replica of 3 for ceph and ability to works with 2 nodes failed.

    I evaluated an external storage like NetApp, that we already use, but I prefer all data on SSD, and NetApp is very expensive.

    What do you think about?
    Thanks
     
  13. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    572
    Likes Received:
    61
    Your VM projection doesnt address how much you'll actually use, which means I cant project how much you can overprovision. My thoughts cant be relevant until I understand
    1. How much space will you be using for ESSENTIAL (read: boot os and database) data
    2. How much space do you need for NON ESSENTIAL (read: media, non latency critical data)
    3. What is your projected growth rate for both
    4. How easy/difficult would it be for you to add nodes in the future (is the cluster locally on prem for you, is it in a colo 2000miles away, etc)

    edit- ignore the below :)
    [̶Q̶U̶O̶T̶E̶=̶"̶a̶l̶e̶s̶s̶i̶c̶e̶,̶ ̶p̶o̶s̶t̶:̶ ̶2̶3̶9̶1̶1̶5̶,̶ ̶m̶e̶m̶b̶e̶r̶:̶ ̶3̶4̶1̶9̶5̶"̶]̶w̶i̶t̶h̶ ̶a̶ ̶r̶e̶p̶l̶i̶c̶a̶ ̶o̶f̶ ̶3̶ ̶f̶o̶r̶ ̶c̶e̶p̶h̶ ̶a̶n̶d̶ ̶a̶b̶i̶l̶i̶t̶y̶ ̶t̶o̶ ̶w̶o̶r̶k̶s̶ ̶w̶i̶t̶h̶ ̶2̶ ̶n̶o̶d̶e̶s̶ ̶f̶a̶i̶l̶e̶d̶.̶[̶/̶Q̶U̶O̶T̶E̶]̶ ̶A̶ ̶3̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶ ̶w̶i̶l̶l̶ ̶n̶o̶t̶ ̶L̶O̶S̶E̶ ̶d̶a̶t̶a̶ ̶w̶i̶t̶h̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶s̶ ̶d̶o̶w̶n̶,̶ ̶i̶t̶ ̶j̶u̶s̶t̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶ ̶w̶o̶n̶t̶ ̶b̶e̶ ̶b̶a̶c̶k̶ ̶i̶n̶ ̶p̶r̶o̶d̶u̶c̶t̶i̶o̶n̶ ̶u̶n̶t̶i̶l̶ ̶y̶o̶u̶ ̶b̶r̶i̶n̶g̶ ̶u̶p̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶o̶n̶e̶ ̶m̶o̶r̶e̶ ̶n̶o̶d̶e̶.̶ ̶t̶o̶ ̶b̶e̶ ̶a̶b̶l̶e̶ ̶t̶o̶ ̶s̶u̶s̶t̶a̶i̶n̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶ ̶f̶a̶i̶l̶u̶r̶e̶s̶ ̶a̶n̶d̶ ̶r̶e̶m̶a̶i̶n̶ ̶i̶n̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶ ̶y̶o̶u̶ ̶w̶o̶u̶l̶d̶ ̶n̶e̶e̶d̶ ̶a̶ ̶4̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶,̶ ̶w̶h̶i̶c̶h̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶'̶l̶l̶ ̶o̶n̶l̶y̶ ̶g̶e̶t̶ ̶2̶5̶%̶ ̶u̶t̶i̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶f̶r̶o̶m̶ ̶y̶o̶u̶r̶ ̶d̶i̶s̶k̶s̶.̶ ̶T̶h̶e̶ ̶o̶n̶l̶y̶ ̶o̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶o̶ ̶a̶c̶h̶i̶e̶v̶e̶ ̶t̶h̶i̶s̶ ̶l̶e̶v̶e̶l̶ ̶o̶f̶ ̶f̶a̶u̶l̶t̶ ̶t̶o̶l̶e̶r̶a̶n̶c̶e̶ ̶i̶s̶ ̶b̶y̶ ̶u̶s̶i̶n̶g̶ ̶e̶r̶a̶s̶u̶r̶e̶ ̶c̶o̶d̶e̶d̶ ̶g̶r̶o̶u̶p̶s̶ ̶w̶h̶i̶c̶h̶ ̶p̶r̶o̶x̶m̶o̶x̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶f̶o̶r̶ ̶m̶a̶i̶n̶ ̶d̶i̶s̶k̶s̶,̶ ̶b̶u̶t̶ ̶c̶o̶u̶l̶d̶ ̶b̶e̶ ̶w̶o̶r̶k̶a̶b̶l̶e̶ ̶f̶o̶r̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶d̶i̶s̶k̶ ̶V̶M̶s̶ ̶(̶a̶n̶d̶ ̶m̶a̶y̶ ̶b̶e̶ ̶a̶ ̶p̶r̶o̶p̶o̶s̶e̶d̶ ̶o̶p̶t̶i̶o̶n̶s̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶h̶o̶w̶ ̶y̶o̶u̶ ̶a̶n̶s̶w̶e̶r̶ ̶t̶h̶e̶ ̶a̶b̶o̶v̶e̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶.̶)̶
     
    #13 alexskysilk, Feb 12, 2019
    Last edited: Feb 12, 2019
  14. Ronny Aasen

    Ronny Aasen New Member

    Joined:
    Mar 15, 2018
    Messages:
    6
    Likes Received:
    0
    if you do need a simple NAS style storage you can do what i do. 8 nodes with some SSD OSD for vm RBD images, and some large spinning osd's for slow storage.

    using the osd classes i can place pools on either ssd or spinning disk.
    i have these pools
    rbd 3x replication on ssd
    cephfs-metadata 3x replication on ssd
    cephfs-ec-data with 4k+2m erasure coded pool HDD giving you 66% of raw capacity on data on this pool
    rbd-hdd with 3x replication on HDD for vm secondary images where the performance does not matter that much.

    the cephfs is mounted wherever needed on linux, and mounted and re-exported with samba for windows clients. the samba exporter is just a vm in proxmox.

    ceph is very flexible in this way. But that flexibillity have a cost in more complexity. Handeling 2 classes is not that easy in proxmox gui, you will need to use the cli more. and it is added complexity and monitoring since you need to monitor the classes separatly. SSD can fill up while HDD does not, and the "average" ceph fill % is normal, but actualy the SSD pools are ready to vurst....
    so if the budget is loose, you may want to keep it simple and use all flash.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice