CEPH placement group and storage usefull capacity

Yvon · Dec 21, 2018

Hi all,
I'm new to CEPH and I want to ask you a few questions :

I have a test PVE cluster of 3 nodes with a ceph storage and I want to host several VM on it.
I have 3 OSD of 500 Gb each.
I want to know what are the placement groups and how they interact with the ODSs
The "size" parameter when I create a pool is vague for me
How can I know for sure the "useful" storage I have ?
I've tried the pg calculator but I'm getting confused

Alwin · Dec 21, 2018

Yvon said:
I want to know what are the placement groups and how they interact with the ODSs

http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/

Yvon said:
The "size" parameter when I create a pool is vague for me

http://docs.ceph.com/docs/luminous/rados/operations/pools/#set-the-number-of-object-replicas

Yvon said:
How can I know for sure the "useful" storage I have ?

http://docs.ceph.com/docs/luminous/rados/operations/monitoring/#checking-a-cluster-s-usage-stats
For planning: http://florian.ca/ceph-calculator/

Yvon said:
I've tried the pg calculator but I'm getting confused

( ( Target PGs per OSD ) x ( OSD # ) x ( %Data ) ) / ( Size ) = Total PG Count
The total PG count has to be divided by the amount of pools that will reside on the cluster with their planned %-usage. Eg. more data a pool needs to hold, the more PGs the pool needs. The calculator has the legend for the calculation below. https://ceph.com/pgcalc/

In general I recommend the architecture guide.
http://docs.ceph.com/docs/luminous/architecture/

Yvon · Dec 21, 2018

Thanks !

Yvon · Dec 21, 2018

Now let's say I have 1 TB worth of VM.
How much space should I need per node with a replication size of 3 to safely run my cluster ?
Regarding the calculator and what you said, no OSD (or node ?) should store more than 1/3rd of the total data for safety right ?
I want to know if ceph is really efficient concerning storage space and how much more storage should I plan for N-TB (or GB) of data.

sb-jw · Dec 23, 2018

If you need 1TB Storage capacity for all of your VMs, then you need 3 TB CEPH Storage if you have replica 3. So for example you need to install 1x 1TB Disk per Node.

But normally you will install more drives and at this point it is a little bit complicated. You want to configure your Crushmap to store a replica on a Node based, not on an OSD base.

So if you have 2x 500GB Disk per Node, a replica of 3 and you will store one replica per Node. Then you are not able to use more then 40% - 50% Space per Node. If OSD #1 have 55% in use and OSD #2 has 50% in use and you have an OSD Failure, then CEPH will try to rebalance the Data to the other OSD, but this will fail 55% + 50% = 105%, the Disk are then more than full.

So its really depends on your final Setup.

Yvon · Dec 24, 2018

Thanks for the reply !

I wanted to know how space/cost-efficient ceph is and how to plan for my future usage.

i think it covered most of my questions for now...

sb-jw · Dec 24, 2018

Yvon said:
I wanted to know how space/cost-efficient ceph is and how to plan for my future usage.

In normal Cases, CEPH isnt efficient. You can use CEPH with erasure coding, then you will save space and cost. AFAIK PVE can not work with EC Pools - correct me if im wrong.

If you can see on my example, it depends on the final Setup. If you have an final Setup, let us know, i will take a look at it

Yvon · Jan 11, 2019

I'm back with POC for my compagny. We will install a 3 nodes CEPH cluster to get highly available VMs. As I've been told, they final setup should look like this :

2 nodes with powerful servers running OSDs AND monitors and a less powerful server for the third monitor

Config of the powerful servers running OSDs and monitors:
2 Xeon Silver 4114 2.2 GHz
128 GB of RAM
1 TB 7200 rpm in RAID 1 for Proxmox
2 4TB 7200rpm for OSDs
1 Gbit NIC
10 Gbit NIC

For the monitor only server, I don't know yet the configuration.

So let me know how should I set my pools and placement goups (I guess for 4 OSDs, I should go for 128 PG or maybe 256).

For the high availability setup, since the third server will not host OSD and will have low processing power, I will exclude it from the HA group.

joshin · Jan 12, 2019

Consider a decent (small) SSD on each node for the journal. Otherwise your write performance will reaaally suck.

As it is, it's not going to be great with only 6 spindles of spinning rust.

Yvon said:
I'm back with POC for my compagny. We will install a 3 nodes CEPH cluster to get highly available VMs. As I've been told, they final setup should look like this :

2 nodes with powerful servers running OSDs AND monitors and a less powerful server for the third monitor

Config of the powerful servers running OSDs and monitors:
2 Xeon Silver 4114 2.2 GHz
128 GB of RAM
1 TB 7200 rpm in RAID 1 for Proxmox
2 4TB 7200rpm for OSDs
1 Gbit NIC
10 Gbit NIC

For the monitor only server, I don't know yet the configuration.

So let me know how should I set my pools and placement goups (I guess for 4 OSDs, I should go for 128 PG or maybe 256).

For the high availability setup, since the third server will not host OSD and will have low processing power, I will exclude it from the HA group.

Yvon · Jan 14, 2019

joshin said:
Consider a decent (small) SSD on each node for the journal. Otherwise your write performance will reaaally suck.

As it is, it's not going to be great with only 6 spindles of spinning rust.

I'll consider it since a 20 GB VM takes 10-15 mins to recover from a node shutdown.

Another question : My HA setup is functionnal but when a node shut down the VM is restarted on another node, is there any way to keep the VM live ?

Stoiko Ivanov · Jan 14, 2019

Yvon said:
Another question : My HA setup is functionnal but when a node shut down the VM is restarted on another node, is there any way to keep the VM live ?

You can always (live-)migrate the VMs before shutting down the node.

Yvon · Jan 14, 2019

What I meant was a non-expected power down. I want the less downtime possible for my VM and from what I've seen, Ceph auto healing process can take a while .

Yvon · Jan 14, 2019

Also when I'm creating a pool for my VM, what is the best size /size_min so I can get as much storage place as possible.

In my test cluster, I have 3 500GB hard drives so roughly 456 GB each for a total of 1.36 TB available.
When I create my pool, with a size of 2 and a size_min of 2, the pool capacity is only of 485 GB so a bit more than a third of my raw storage capacity for only 2 replicas. So why with a size of 2 I don't get half of my 1.36 TB as pool storage ?

Alwin · Jan 14, 2019

Yvon said:
What I meant was a non-expected power down. I want the less downtime possible for my VM and from what I've seen, Ceph auto healing process can take a while .

Ceph's auto-healing is working independent of VM migration. VMs under HA will restart on a different nodes, as long as the Ceph storage is in RW mode.

Yvon said:
In my test cluster, I have 3 500GB hard drives so roughly 456 GB each for a total of 1.36 TB available.
When I create my pool, with a size of 2 and a size_min of 2, the pool capacity is only of 485 GB so a bit more than a third of my raw storage capacity for only 2 replicas. So why with a size of 2 I don't get half of my 1.36 TB as pool storage ?

By default replication is done per host.

Yvon · Jan 14, 2019

Alwin said:
Ceph's auto-healing is working independent of VM migration. VMs under HA will restart on a different nodes, as long as the Ceph storage is in RW mode.

By default replication is done per host.

Alright so how can I maximise the Ceph efficiency to get as much storage possible for my pool ? I tried to set the default replication size as it follow but maybe I'm wrong :

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.21.0/24
fsid = 8be2685d-1a53-4ec2-9596-735f1a22dab3
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 2
public network = 192.168.21.0/24

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve3]
host = pve3
mon addr = 192.168.21.211:6789

[mon.pve2]
host = pve2
mon addr = 192.168.21.221:6789

[mon.pve1]
host = pve1
mon addr = 192.168.21.199:6789

Another thing : I have two pools with the same size values but somehow they don't have the same capacity :

ceph-vm pool :

test3 pool :

When I run the ceph df command, the capacity doesn't match what I get from the GUI :

Unless as you said, replication is done per host and then the 486 GB is how much I have on pve1 but it doesn't match with my 456 GB hard drive.

Alwin · Jan 14, 2019

Yvon said:
Alright so how can I maximise the Ceph efficiency to get as much storage possible for my pool ? I tried to set the default replication size as it follow but maybe I'm wrong :

Buy bigger/more disks.

Yvon said:
Unless as you said, replication is done per host and then the 486 GB is how much I have on pve1 but it doesn't match with my 456 GB hard drive.

486GiB + 150 GiB = 636 GiB of data that could be stored. 636 GiB x 2 (replica) = 1272 GiB of raw space. A littel overhead for OSD DB/WAL needs to be added too.

Yvon · Jan 14, 2019

Thanks !

So if I understand correctly to reduce the overhead, I should reduce the pool size from 3 to 2 to have 1 object and his replica.
I'm still not about how the size / min_size works. From what I understood size is the number of replicas with the object included so 2 replicas and the object and the min size sets the minimum number of replicas required for I/O. I am right ?

Alwin · Jan 15, 2019

Yvon said:
So if I understand correctly to reduce the overhead, I should reduce the pool size from 3 to 2 to have 1 object and his replica.

The replica count is not overhead. You simply need more or bigger disks to gain more space.

Yvon said:
From what I understood size is the number of replicas with the object included so 2 replicas and the object and the min size sets the minimum number of replicas required for I/O. I am right ?

Exactly. This ensures that there needs to be at least two copies to have IO.

Yvon · Jan 15, 2019

Thanks a lot for your help Alwin !

So for now there is no other way to achieve a bigger pool capacity than getting bigger disks, the size and min_size doesn't matter that much.

In future releases, will Proxmox support Mimic or erasure coded pool by any chances ?

Alwin · Jan 15, 2019

Yvon said:
In future releases, will Proxmox support Mimic or erasure coded pool by any chances ?

No plans for that.

CEPH placement group and storage usefull capacity

Member

Attachments

Proxmox Retired Staff

Member

Member

Famous Member

Member

Famous Member

Member

Renowned Member

Member

Proxmox Staff Member

Member

Member

Attachments

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

We value your privacy