CEPH placement group and storage usefull capacity

Yvon · Jan 15, 2019

Alright.
I think I cleared out lot of questions, thanks a lot again !

Yvon · Jan 15, 2019

Alright.
By the way, is cache tiering interesting with replicated pools (I will store VMs on it ) since I'll mostly use hard drives (maybe SSDs for logs) ?
I may invest in few SSDs to create a cache tiering pool to increase performances.

Alwin · Jan 16, 2019

Don't, create two pools one for the SSDs (enterprise class) and one for the HDDs.
http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/#known-bad-workloads

Yvon · Jan 16, 2019

Nevermind :

KNOWN BAD WORKLOADS
The following configurations are known to work poorly with cache tiering.

RBD with replicated cache and erasure-coded base: This is a common request, but usually does not perform well. Even reasonably skewed workloads still send some small writes to cold objects, and because small writes are not yet supported by the erasure-coded pool, entire (usually 4 MB) objects must be migrated into the cache in order to satisfy a small (often 4 KB) write. Only a handful of users have successfully deployed this configuration, and it only works for them because their data is extremely cold (backups) and they are not in any way sensitive to performance.

RBD with replicated cache and base: RBD with a replicated base tier does better than when the base is erasure coded, but it is still highly dependent on the amount of skew in the workload, and very difficult to validate. The user will need to have a good understanding of their workload and will need to tune the cache tiering parameters carefully.

I found that yesterday a few minutes after posting my question...

Yvon · Jan 16, 2019

So for now I think I'll stick to hard drives and maybe SSDs for journals.

But I wonder what is written in those journals. Is it logs, metadata, will storing journals on the OSD really inpact the performances ?

Yvon · Jan 16, 2019

Alwin said:
Don't, create two pools one for the SSDs (enterprise class) and one for the HDDs.
http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/#known-bad-workloads

Seems like a good idea since running applications that needs high I/O on spinning disks would be nonsense. I thought about something like this :

Obviously I'll need to modify the CRUSH map to map a specific pool to a specific type of OSD.

I found what I was looking for https://forum.proxmox.com/threads/ceph-ssd-and-hdd-pools.42032/

Yvon · Jan 17, 2019

rule replicated_ssd {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}

I'm not sure about what the two lines in red does.

Yvon · Jan 17, 2019

Asides from a SSD only pool, is possible when creating a pool to target specifically some OSDs ? Or does it goes against the fundamental princip of CEPH wich is to dynamically rebalance data ?

Alwin · Jan 18, 2019

Did you see this?
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_crush_amp_device_classes

Yvon · Jan 18, 2019

Yeah, so you can make a pool wich targets a specific class (HDD, SSD, NVME) but you can't specifically target an OSD by his ID or his name in the CRUSH map wich makes sense since an average production cluster hosts way more than 10 OSD.

By the way SAS and SATA hard drives would still be considered as HDD by CEPH ?
Am I right ?

Yvon · Jan 21, 2019

Is it possible to select on which OSD your pool will be affected (in the case I stay with a hard drive only cluster) ?

Targeting an host instead of a class would be good enough.

Alwin · Jan 22, 2019

Yvon said:
Yeah, so you can make a pool wich targets a specific class (HDD, SSD, NVME) but you can't specifically target an OSD by his ID or his name in the CRUSH map wich makes sense since an average production cluster hosts way more than 10 OSD.

OSD IDs are reusable and not fixed. When you edit the CRUSH map then things like this can be possible. If you don't have a deep understand of what will be happening with your data, I advise against it. To give you an idea and older post, but still valid for the most part.
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Yvon said:
Is it possible to select on which OSD your pool will be affected (in the case I stay with a hard drive only cluster) ?

See above.

Yvon · Jan 22, 2019

Thanks Alwin !

From what I've read, the data placement and replication wich such scenario would be awful.

https://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/

This article looks promising, without the cons of the uneven replication of the link you posted.
Since OSD have a number (device 0 osd.0 class, hdd device 1 osd.1 class hdd), if an OSD fails and gets removed, does the numbers of all remaining OSD stay the same or are they decremented by 1 ?

Yvon · Jan 22, 2019

I tried to agregate OSD by buckets likes this :
pool ssd {
id -9
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.455
item osd.2 weight 0.454

}

pool sas {
id -10
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.455
item osd.3 weight 0.454
}

with a rule for each of them :

rule ssd {
ruleset 3
type replicated
min_size 1
max_size 10
step take ssd
step choose firstn 0 type osd
step emit
}

rule sas {
ruleset 4
type replicated
min_size 1
max_size 10
step take sas
step choose firstn 0 type osd
step emit
}

But as soon as I want to recompile the crush map I get this error : bucket type 'pool' is not defined

Yvon · Jan 22, 2019

I found why ceph prompted an error message : at the beginning of of crush map, I need to add pool to the types of buckets :

# types type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 pool

Search

Search

CEPH placement group and storage usefull capacity

Yvon

Member

Yvon

Member

Alwin

Proxmox Retired Staff

Yvon

Member

Yvon

Member

Yvon

Member

Yvon

Member

Yvon

Member

Alwin

Proxmox Retired Staff

Yvon

Member

Yvon

Member

Alwin

Proxmox Retired Staff

Yvon

Member

Yvon

Member

Yvon

Member

We value your privacy