CEPH placement group and storage usefull capacity

Discussion in 'Proxmox VE: Installation and configuration' started by Yvon, Dec 21, 2018.

  1. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Hi all,
    I'm new to CEPH and I want to ask you a few questions :

    I have a test PVE cluster of 3 nodes with a ceph storage and I want to host several VM on it.
    I have 3 OSD of 500 Gb each.
    I want to know what are the placement groups and how they interact with the ODSs
    The "size" parameter when I create a pool is vague for me
    How can I know for sure the "useful" storage I have ?
    I've tried the pg calculator but I'm getting confused
     

    Attached Files:

  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,309
    Likes Received:
    206
    http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/

    http://docs.ceph.com/docs/luminous/rados/operations/pools/#set-the-number-of-object-replicas

    http://docs.ceph.com/docs/luminous/rados/operations/monitoring/#checking-a-cluster-s-usage-stats
    For planning: http://florian.ca/ceph-calculator/

    ( ( Target PGs per OSD ) x ( OSD # ) x ( %Data ) ) / ( Size ) = Total PG Count
    The total PG count has to be divided by the amount of pools that will reside on the cluster with their planned %-usage. Eg. more data a pool needs to hold, the more PGs the pool needs. The calculator has the legend for the calculation below. https://ceph.com/pgcalc/

    In general I recommend the architecture guide.
    http://docs.ceph.com/docs/luminous/architecture/
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Thanks !
     
  4. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Now let's say I have 1 TB worth of VM.
    How much space should I need per node with a replication size of 3 to safely run my cluster ?
    Regarding the calculator and what you said, no OSD (or node ?) should store more than 1/3rd of the total data for safety right ?
    I want to know if ceph is really efficient concerning storage space and how much more storage should I plan for N-TB (or GB) of data.
     
  5. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    547
    Likes Received:
    47
    If you need 1TB Storage capacity for all of your VMs, then you need 3 TB CEPH Storage if you have replica 3. So for example you need to install 1x 1TB Disk per Node.

    But normally you will install more drives and at this point it is a little bit complicated. You want to configure your Crushmap to store a replica on a Node based, not on an OSD base.

    So if you have 2x 500GB Disk per Node, a replica of 3 and you will store one replica per Node. Then you are not able to use more then 40% - 50% Space per Node. If OSD #1 have 55% in use and OSD #2 has 50% in use and you have an OSD Failure, then CEPH will try to rebalance the Data to the other OSD, but this will fail 55% + 50% = 105%, the Disk are then more than full.

    So its really depends on your final Setup.
     
  6. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Thanks for the reply !

    I wanted to know how space/cost-efficient ceph is and how to plan for my future usage.

    i think it covered most of my questions for now...
     
  7. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    547
    Likes Received:
    47
    In normal Cases, CEPH isnt efficient. You can use CEPH with erasure coding, then you will save space and cost. AFAIK PVE can not work with EC Pools - correct me if im wrong.

    If you can see on my example, it depends on the final Setup. If you have an final Setup, let us know, i will take a look at it :)
     
  8. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    I'm back with POC for my compagny. We will install a 3 nodes CEPH cluster to get highly available VMs. As I've been told, they final setup should look like this :

    2 nodes with powerful servers running OSDs AND monitors and a less powerful server for the third monitor

    Config of the powerful servers running OSDs and monitors:
    2 Xeon Silver 4114 2.2 GHz
    128 GB of RAM
    1 TB 7200 rpm in RAID 1 for Proxmox
    2 4TB 7200rpm for OSDs
    1 Gbit NIC
    10 Gbit NIC

    For the monitor only server, I don't know yet the configuration.

    So let me know how should I set my pools and placement goups (I guess for 4 OSDs, I should go for 128 PG or maybe 256).

    For the high availability setup, since the third server will not host OSD and will have low processing power, I will exclude it from the HA group.
     
    #8 Yvon, Jan 11, 2019
    Last edited: Jan 11, 2019
  9. joshin

    joshin Member
    Proxmox Subscriber

    Joined:
    Jul 23, 2013
    Messages:
    92
    Likes Received:
    8
    Consider a decent (small) SSD on each node for the journal. Otherwise your write performance will reaaally suck.

    As it is, it's not going to be great with only 6 spindles of spinning rust.



     
  10. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    I'll consider it since a 20 GB VM takes 10-15 mins to recover from a node shutdown.

    Another question : My HA setup is functionnal but when a node shut down the VM is restarted on another node, is there any way to keep the VM live ?
     
  11. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,269
    Likes Received:
    117
    You can always (live-)migrate the VMs before shutting down the node.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    What I meant was a non-expected power down. I want the less downtime possible for my VM and from what I've seen, Ceph auto healing process can take a while .
     
  13. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Also when I'm creating a pool for my VM, what is the best size /size_min so I can get as much storage place as possible.

    In my test cluster, I have 3 500GB hard drives so roughly 456 GB each for a total of 1.36 TB available.
    When I create my pool, with a size of 2 and a size_min of 2, the pool capacity is only of 485 GB so a bit more than a third of my raw storage capacity for only 2 replicas. So why with a size of 2 I don't get half of my 1.36 TB as pool storage ?
     
  14. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,309
    Likes Received:
    206
    Ceph's auto-healing is working independent of VM migration. VMs under HA will restart on a different nodes, as long as the Ceph storage is in RW mode.

    By default replication is done per host.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Alright so how can I maximise the Ceph efficiency to get as much storage possible for my pool ? I tried to set the default replication size as it follow but maybe I'm wrong :

    [global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 192.168.21.0/24
    fsid = 8be2685d-1a53-4ec2-9596-735f1a22dab3
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 2
    public network = 192.168.21.0/24

    [mds]
    keyring = /var/lib/ceph/mds/ceph-$id/keyring

    [osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

    [mon.pve3]
    host = pve3
    mon addr = 192.168.21.211:6789

    [mon.pve2]
    host = pve2
    mon addr = 192.168.21.221:6789

    [mon.pve1]
    host = pve1
    mon addr = 192.168.21.199:6789


    Another thing : I have two pools with the same size values but somehow they don't have the same capacity :
    upload_2019-1-14_16-5-33.png

    ceph-vm pool :

    upload_2019-1-14_16-6-6.png

    test3 pool : upload_2019-1-14_16-6-35.png

    When I run the ceph df command, the capacity doesn't match what I get from the GUI :
    upload_2019-1-14_16-46-52.png

    Unless as you said, replication is done per host and then the 486 GB is how much I have on pve1 but it doesn't match with my 456 GB hard drive.
     
    #15 Yvon, Jan 14, 2019
    Last edited: Jan 14, 2019
  16. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,309
    Likes Received:
    206
    Buy bigger/more disks.

    486GiB + 150 GiB = 636 GiB of data that could be stored. 636 GiB x 2 (replica) = 1272 GiB of raw space. A littel overhead for OSD DB/WAL needs to be added too.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  17. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Thanks !

    So if I understand correctly to reduce the overhead, I should reduce the pool size from 3 to 2 to have 1 object and his replica.
    I'm still not about how the size / min_size works. From what I understood size is the number of replicas with the object included so 2 replicas and the object and the min size sets the minimum number of replicas required for I/O. I am right ?
     
  18. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,309
    Likes Received:
    206
    The replica count is not overhead. You simply need more or bigger disks to gain more space.

    Exactly. This ensures that there needs to be at least two copies to have IO.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    Yvon likes this.
  19. Yvon

    Yvon Member

    Joined:
    Dec 20, 2018
    Messages:
    32
    Likes Received:
    0
    Thanks a lot for your help Alwin !

    So for now there is no other way to achieve a bigger pool capacity than getting bigger disks, the size and min_size doesn't matter that much.

    In future releases, will Proxmox support Mimic or erasure coded pool by any chances ?
     
  20. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,309
    Likes Received:
    206
    No plans for that.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice