CEPH placement group and storage usefull capacity

Yvon

Member
Dec 20, 2018
32
0
6
28
Hi all,
I'm new to CEPH and I want to ask you a few questions :

I have a test PVE cluster of 3 nodes with a ceph storage and I want to host several VM on it.
I have 3 OSD of 500 Gb each.
I want to know what are the placement groups and how they interact with the ODSs
The "size" parameter when I create a pool is vague for me
How can I know for sure the "useful" storage I have ?
I've tried the pg calculator but I'm getting confused
 

Attachments

  • 2018-12-21 10_33_28-pve3 - Proxmox Virtual Environment.png
    2018-12-21 10_33_28-pve3 - Proxmox Virtual Environment.png
    8.5 KB · Views: 70
I want to know what are the placement groups and how they interact with the ODSs
http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/

The "size" parameter when I create a pool is vague for me
http://docs.ceph.com/docs/luminous/rados/operations/pools/#set-the-number-of-object-replicas

How can I know for sure the "useful" storage I have ?
http://docs.ceph.com/docs/luminous/rados/operations/monitoring/#checking-a-cluster-s-usage-stats
For planning: http://florian.ca/ceph-calculator/

I've tried the pg calculator but I'm getting confused
( ( Target PGs per OSD ) x ( OSD # ) x ( %Data ) ) / ( Size ) = Total PG Count
The total PG count has to be divided by the amount of pools that will reside on the cluster with their planned %-usage. Eg. more data a pool needs to hold, the more PGs the pool needs. The calculator has the legend for the calculation below. https://ceph.com/pgcalc/

In general I recommend the architecture guide.
http://docs.ceph.com/docs/luminous/architecture/
 
Now let's say I have 1 TB worth of VM.
How much space should I need per node with a replication size of 3 to safely run my cluster ?
Regarding the calculator and what you said, no OSD (or node ?) should store more than 1/3rd of the total data for safety right ?
I want to know if ceph is really efficient concerning storage space and how much more storage should I plan for N-TB (or GB) of data.
 
If you need 1TB Storage capacity for all of your VMs, then you need 3 TB CEPH Storage if you have replica 3. So for example you need to install 1x 1TB Disk per Node.

But normally you will install more drives and at this point it is a little bit complicated. You want to configure your Crushmap to store a replica on a Node based, not on an OSD base.

So if you have 2x 500GB Disk per Node, a replica of 3 and you will store one replica per Node. Then you are not able to use more then 40% - 50% Space per Node. If OSD #1 have 55% in use and OSD #2 has 50% in use and you have an OSD Failure, then CEPH will try to rebalance the Data to the other OSD, but this will fail 55% + 50% = 105%, the Disk are then more than full.

So its really depends on your final Setup.
 
Thanks for the reply !

I wanted to know how space/cost-efficient ceph is and how to plan for my future usage.

i think it covered most of my questions for now...
 
I wanted to know how space/cost-efficient ceph is and how to plan for my future usage.
In normal Cases, CEPH isnt efficient. You can use CEPH with erasure coding, then you will save space and cost. AFAIK PVE can not work with EC Pools - correct me if im wrong.

If you can see on my example, it depends on the final Setup. If you have an final Setup, let us know, i will take a look at it :)
 
I'm back with POC for my compagny. We will install a 3 nodes CEPH cluster to get highly available VMs. As I've been told, they final setup should look like this :

2 nodes with powerful servers running OSDs AND monitors and a less powerful server for the third monitor

Config of the powerful servers running OSDs and monitors:
2 Xeon Silver 4114 2.2 GHz
128 GB of RAM
1 TB 7200 rpm in RAID 1 for Proxmox
2 4TB 7200rpm for OSDs
1 Gbit NIC
10 Gbit NIC

For the monitor only server, I don't know yet the configuration.

So let me know how should I set my pools and placement goups (I guess for 4 OSDs, I should go for 128 PG or maybe 256).

For the high availability setup, since the third server will not host OSD and will have low processing power, I will exclude it from the HA group.
 
Last edited:
Consider a decent (small) SSD on each node for the journal. Otherwise your write performance will reaaally suck.

As it is, it's not going to be great with only 6 spindles of spinning rust.



I'm back with POC for my compagny. We will install a 3 nodes CEPH cluster to get highly available VMs. As I've been told, they final setup should look like this :

2 nodes with powerful servers running OSDs AND monitors and a less powerful server for the third monitor

Config of the powerful servers running OSDs and monitors:
2 Xeon Silver 4114 2.2 GHz
128 GB of RAM
1 TB 7200 rpm in RAID 1 for Proxmox
2 4TB 7200rpm for OSDs
1 Gbit NIC
10 Gbit NIC

For the monitor only server, I don't know yet the configuration.

So let me know how should I set my pools and placement goups (I guess for 4 OSDs, I should go for 128 PG or maybe 256).

For the high availability setup, since the third server will not host OSD and will have low processing power, I will exclude it from the HA group.
 
Consider a decent (small) SSD on each node for the journal. Otherwise your write performance will reaaally suck.

As it is, it's not going to be great with only 6 spindles of spinning rust.

I'll consider it since a 20 GB VM takes 10-15 mins to recover from a node shutdown.

Another question : My HA setup is functionnal but when a node shut down the VM is restarted on another node, is there any way to keep the VM live ?
 
Another question : My HA setup is functionnal but when a node shut down the VM is restarted on another node, is there any way to keep the VM live ?
You can always (live-)migrate the VMs before shutting down the node.
 
What I meant was a non-expected power down. I want the less downtime possible for my VM and from what I've seen, Ceph auto healing process can take a while .
 
Also when I'm creating a pool for my VM, what is the best size /size_min so I can get as much storage place as possible.

In my test cluster, I have 3 500GB hard drives so roughly 456 GB each for a total of 1.36 TB available.
When I create my pool, with a size of 2 and a size_min of 2, the pool capacity is only of 485 GB so a bit more than a third of my raw storage capacity for only 2 replicas. So why with a size of 2 I don't get half of my 1.36 TB as pool storage ?
 

Attachments

  • 2019-01-14 14_29_29-pve1 - Proxmox Virtual Environment.png
    2019-01-14 14_29_29-pve1 - Proxmox Virtual Environment.png
    1.3 KB · Views: 10
  • 2019-01-14 14_29_59-pve1 - Proxmox Virtual Environment.png
    2019-01-14 14_29_59-pve1 - Proxmox Virtual Environment.png
    3.8 KB · Views: 10
What I meant was a non-expected power down. I want the less downtime possible for my VM and from what I've seen, Ceph auto healing process can take a while .
Ceph's auto-healing is working independent of VM migration. VMs under HA will restart on a different nodes, as long as the Ceph storage is in RW mode.

In my test cluster, I have 3 500GB hard drives so roughly 456 GB each for a total of 1.36 TB available.
When I create my pool, with a size of 2 and a size_min of 2, the pool capacity is only of 485 GB so a bit more than a third of my raw storage capacity for only 2 replicas. So why with a size of 2 I don't get half of my 1.36 TB as pool storage ?
By default replication is done per host.
 
Ceph's auto-healing is working independent of VM migration. VMs under HA will restart on a different nodes, as long as the Ceph storage is in RW mode.


By default replication is done per host.

Alright so how can I maximise the Ceph efficiency to get as much storage possible for my pool ? I tried to set the default replication size as it follow but maybe I'm wrong :

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.21.0/24
fsid = 8be2685d-1a53-4ec2-9596-735f1a22dab3
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 2
public network = 192.168.21.0/24

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve3]
host = pve3
mon addr = 192.168.21.211:6789

[mon.pve2]
host = pve2
mon addr = 192.168.21.221:6789

[mon.pve1]
host = pve1
mon addr = 192.168.21.199:6789


Another thing : I have two pools with the same size values but somehow they don't have the same capacity :
upload_2019-1-14_16-5-33.png

ceph-vm pool :

upload_2019-1-14_16-6-6.png

test3 pool : upload_2019-1-14_16-6-35.png

When I run the ceph df command, the capacity doesn't match what I get from the GUI :
upload_2019-1-14_16-46-52.png

Unless as you said, replication is done per host and then the 486 GB is how much I have on pve1 but it doesn't match with my 456 GB hard drive.
 
Last edited:
Alright so how can I maximise the Ceph efficiency to get as much storage possible for my pool ? I tried to set the default replication size as it follow but maybe I'm wrong :
Buy bigger/more disks.

Unless as you said, replication is done per host and then the 486 GB is how much I have on pve1 but it doesn't match with my 456 GB hard drive.
486GiB + 150 GiB = 636 GiB of data that could be stored. 636 GiB x 2 (replica) = 1272 GiB of raw space. A littel overhead for OSD DB/WAL needs to be added too.
 
Thanks !

So if I understand correctly to reduce the overhead, I should reduce the pool size from 3 to 2 to have 1 object and his replica.
I'm still not about how the size / min_size works. From what I understood size is the number of replicas with the object included so 2 replicas and the object and the min size sets the minimum number of replicas required for I/O. I am right ?
 
So if I understand correctly to reduce the overhead, I should reduce the pool size from 3 to 2 to have 1 object and his replica.
The replica count is not overhead. You simply need more or bigger disks to gain more space.

From what I understood size is the number of replicas with the object included so 2 replicas and the object and the min size sets the minimum number of replicas required for I/O. I am right ?
Exactly. This ensures that there needs to be at least two copies to have IO.
 
  • Like
Reactions: Yvon
Thanks a lot for your help Alwin !

So for now there is no other way to achieve a bigger pool capacity than getting bigger disks, the size and min_size doesn't matter that much.

In future releases, will Proxmox support Mimic or erasure coded pool by any chances ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!