CEPH: osd_pool_default_size > # nodes possible?

leveche · Mar 1, 2023

Hello All,

I have allocated two nodes in my PVE cluster for storage. The idea is to use cephfs wherever possible, and possibly set up ganesha NFS or samba file share export in containers on the storage nodes for applications that need those.

I don't particularly trust the disks I have now: all are out of warranty, an eclectic collection ranging from 10TB 'enterprize' Seagate from 2017 to 3TB WD 'red' from early 2010s. Since this is a lab setup, I'd like to burn through what I have before I start purchasing replacements. On the other hand, I'd like to avoid losing data, it's a hassle to restore from backup.

I thought that by setting osd_pool_default_size = 6 with the default CRUSH rule I'd get 3-way mirror per node, on both storage nodes. Not very efficient or fast, but at least redundant enough. It seems I thought wrong: after creating a pool I see all its PGs in active+undersized state, and ceph pg ls shows that only two OSDs are used for allocation, e.g:

Code:

PG    OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES    OMAP_BYTES*  OMAP_KEYS*  LOG   STATE                  SINCE  VERSION   REPORTED  UP          ACTING            SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING                                          
...
2.0         0         0          0        0        0            0           0     0      active+undersized     7h       0'0    167:42   [17,5]p17         [17,5]p17  2023-02-28T10:10:55.591539-0700  2023-02-28T10:10:55.591539-0700                    0  periodic scrub scheduled @ 2023-03-01T21:02:41.807648+0000
...

I pasted my config below. Do I need to modify the crush rule? Or am I missing something else here?

If I do need to touch the crush rule, one approach I'm considering is to add another bucket type, controller, between osd and host in the hierarchy. Is my understanding correct that the integer type identifier is not meaningful, and if I add e.g.

Code:

type 12 controller

it would not imply that controller becomes the root of the hierarchy?

Thanks
leveche

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 169.254.121.12/24
     mon_allow_pool_delete = true
     mon_host = 169.254.121.12 169.254.121.11 169.254.121.118
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 4
     osd_pool_default_size = 6
     public_network = 169.254.121.12/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.fsm1]
     public_addr = 169.254.121.11

[mon.fsm2]
     public_addr = 169.254.121.118

[mon.fsm3]
     public_addr = 169.254.121.12

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host fsm1 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 32.74811
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 5.45799
    item osd.3 weight 3.63869
    item osd.5 weight 3.63869
    item osd.7 weight 3.63869
    item osd.9 weight 3.63869
    item osd.10 weight 5.45799
    item osd.11 weight 3.63869
    item osd.12 weight 3.63869
}
host fsm3 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 41.71887
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 4.54839
    item osd.2 weight 3.51369
    item osd.4 weight 2.72899
    item osd.6 weight 2.72899
    item osd.8 weight 4.54839
    item osd.13 weight 4.54839
    item osd.14 weight 3.63869
    item osd.15 weight 3.63869
    item osd.16 weight 9.09569
    item osd.17 weight 2.72899
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 74.46698
    alg straw2
    hash 0    # rjenkins1
    item fsm1 weight 32.74811
    item fsm3 weight 41.71887
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

gurubert · Mar 1, 2023

If you only have 2 storage nodes do not use Ceph at all, please.

leveche · Mar 2, 2023

Your reply lacks specificity, hence is of very limited usefulness.

And let me qualify: I have two storage nodes dedicated to OSDs. You will see in the posted config that I have three nodes for MON and MDS, with the capacity to increase to e.g. five nodes for the latter. But only two chassis with multiple storage bays.

CEPH: osd_pool_default_size > # nodes possible?

leveche

New Member

gurubert

Distinguished Member

leveche

New Member

We value your privacy