ceph pool and max available size

danielc · Jul 9, 2018

Hello,

I continue working with my ceph servers with proxmox, and i found a very big concerns of the problem regarding the max available:

root@ceph1:~# ceph -s
cluster:
id:
health: HEALTH_OK

services:
mon: 5 daemons, quorum ceph1,ceph2,ceph3,ceph4,hv4
mgr: ceph1(active), standbys: ceph3, hv4, ceph4, ceph2
osd: 32 osds: 32 up, 32 in

data:
pools: 1 pools, 256 pgs
objects: 200k objects, 802 GB
usage: 3336 GB used, 113 TB / 116 TB avail
pgs: 256 active+clean

root@ceph1:~# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
116T 113T 3335G 2.80
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
hdd_new 15 802G 5.56 13631G 205626

While I have 116T in total, i can only use 13631G (13T). This is really a big concerns.
How can i solve this problem to use more spaces? 116T VS 13T you can see there is a very big gap here.
My crush map as follow:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 29.112
alg straw2
hash 0 # rjenkins1
item osd.0 weight 3.639
item osd.1 weight 3.639
item osd.2 weight 3.639
item osd.3 weight 3.639
item osd.4 weight 3.639
item osd.5 weight 3.639
item osd.6 weight 3.639
item osd.7 weight 3.639
}
host ceph2 {
id -5 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 29.112
alg straw2
hash 0 # rjenkins1
item osd.8 weight 3.639
item osd.9 weight 3.639
item osd.10 weight 3.639
item osd.11 weight 3.639
item osd.12 weight 3.639
item osd.13 weight 3.639
item osd.14 weight 3.639
item osd.15 weight 3.639
}
datacenter A {
id -2 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 58.218
alg straw2
hash 0 # rjenkins1
item ceph1 weight 29.109
item ceph2 weight 29.109
}
host ceph3 {
id -7 # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 29.112
alg straw2
hash 0 # rjenkins1
item osd.16 weight 3.639
item osd.17 weight 3.639
item osd.18 weight 3.639
item osd.19 weight 3.639
item osd.20 weight 3.639
item osd.21 weight 3.639
item osd.22 weight 3.639
item osd.23 weight 3.639
}
host ceph4 {
id -9 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 29.112
alg straw2
hash 0 # rjenkins1
item osd.24 weight 3.639
item osd.25 weight 3.639
item osd.26 weight 3.639
item osd.27 weight 3.639
item osd.28 weight 3.639
item osd.29 weight 3.639
item osd.30 weight 3.639
item osd.31 weight 3.639
}
datacenter B {
id -4 # do not change unnecessarily
id -13 class hdd # do not change unnecessarily
# weight 58.218
alg straw2
hash 0 # rjenkins1
item ceph3 weight 29.109
item ceph4 weight 29.109
}
root default {
id -1 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 2.000
alg straw2
hash 0 # rjenkins1
item A weight 1.000
item B weight 1.000
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 2
max_size 4
step take default
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}
rule erasure_hdd {
id 1
type erasure
min_size 3
max_size 5
step take default class hdd
step set_chooseleaf_tries 5
step set_choose_tries 100
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}
# end crush map

And have this pool setting:

root@ceph1:~# ceph osd dump | more
......
pool 15 'hdd_new' replicated size 4 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 5061 flags hashpspool stripe_
width 0 application rbd
removed_snaps [1~3]

Would you please help on this? Thank you.

Alwin · Jul 10, 2018

danielc said:
mon: 5 daemons, quorum ceph1,ceph2,ceph3,ceph4,hv4

You only need 3 MONs, with clusters to the 1000s of nodes additional MONs may be required. Same for Manager.

danielc said:
pools: 1 pools, 256 pgs

One pool with 32 OSDs and a replica of 4, should have around 1024 PGs. https://ceph.com/pgcalc/

danielc said:
SIZE AVAIL RAW USED %RAW USED
116T 113T 3335G 2.80

You split your cluster in half with your crush rules, 116/2 = 58, with replica 4, 58/4 = 14.5. So you have 13TB in the end available for data.

In the case of loosing one datacenter, you cluster will go into read-only and one side with more MONs will be favored.

Here some more links to the topic(s).
https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842
https://forum.proxmox.com/threads/understanding-ceph-failure-modes.44038/#post-210936
https://forum.proxmox.com/threads/proxmox-ceph-cluster-advice.42891/#post-205912

danielc · Jul 10, 2018

Hello Alwin,

Alwin said:
One pool with 32 OSDs and a replica of 4, should have around 1024 PGs. https://ceph.com/pgcalc/

Forgive me , but isnt it calculated as 256*4 = 1024 PGs?

Alwin · Jul 10, 2018

You can see for yourself, a OSD target PG size is 100. The next power of two close to that 100 is a good match. With 'ceph osd df tree' you will get the PGs on each OSD.

danielc · Jul 10, 2018

Alwin said:
You can see for yourself, a OSD target PG size is 100. The next power of two close to that 100 is a good match. With 'ceph osd df tree' you will get the PGs on each OSD.

OK , but i comes up with a confusing question - I increased the pg_num and pgp_num to 1024 to the hdd_new ceph pool,

data:
pools: 1 pools, 1024 pgs
objects: 200k objects, 802 GB

but then when i create a new addition backup pool,

mon_command failed - pg_num 1024 size 3 would mean 7168 total pgs, which exceeds max 6400 (mon_max_pg_per_osd 200 * num_in_osds 32)

This is the most confusing part. How should i solve this problem?

Alwin · Jul 10, 2018

Yes, as the PGs per OSD will above the 200. The 1024 counts for one pool on 32 OSDs. With multiple pools the numbers are divided by the amount of replica and data that will be placed on those pools. See the link above for the calculator, there you can check the PG counts.

Search

Search

ceph pool and max available size

danielc

Member

Alwin

Proxmox Retired Staff

danielc

Member

Alwin

Proxmox Retired Staff

danielc

Member

Alwin

Proxmox Retired Staff

We value your privacy