Ceph Crush Map for 2 Chassis, 8 Node Cluster

Jan 13, 2023
5
0
1
I have two chassis with 4 nodes each. Each node will have 4 OSDs. I'm trying to write the Crush rule to ensure that all three copies don't end up in the same chassis. I modified the existing rule as below, but half the PGs went "unknown" until I put it back. I fear I just don't understand choose vs chooseleaf.

Before change:
Code:
rule replicated_nvme {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class nvme
         step chooseleaf firstn 0 type host
        step emit
}

After change:
Code:
rule replicated_nvme {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class nvme
        step choose firstn 2 type chassis
        step chooseleaf firstn 0 type host
        step emit
}
 
Please post your complete ceph.conf and your crushmap and the output of ceph osd tree
Please also post the steps you took that lead to the "half pgs are unknown error".
 
Last edited:
  • Like
Reactions: tstinnette
ceph.conf
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.88.1/24
         fsid = 8f36590c-ac16-481b-ac8d-b9e6cb8e17e1
         mon_allow_pool_delete = true
         mon_host = 192.168.88.1 192.168.88.2 192.168.88.3 192.168.88.4 192.168.88.8 192.168.88.7
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.88.1/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve-11]
         host = pve-11
         mds_standby_for_name = pve

[mds.pve-12]
         host = pve-12
         mds_standby_for_name = pve

[mds.pve-13]
         host = pve-13
         mds_standby_for_name = pve

[mds.pve-14]
         host = pve-14
         mds_standby_for_name = pve

[mon.pve-07]
         public_addr = 192.168.88.7

[mon.pve-08]
         public_addr = 192.168.88.8

[mon.pve-11]
         public_addr = 192.168.88.1

[mon.pve-12]
         public_addr = 192.168.88.2

[mon.pve-13]
         public_addr = 192.168.88.3

[mon.pve-14]
         public_addr = 192.168.88.4

crushmap
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class nvme
device 17 osd.17 class nvme
device 18 osd.18 class nvme
device 19 osd.19 class nvme
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 28 osd.28 class nvme
device 29 osd.29 class nvme
device 30 osd.30 class nvme
device 31 osd.31 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve-07 {
        id -16          # do not change unnecessarily
        id -17 class ssd                # do not change unnecessarily
        id -18 class nvme               # do not change unnecessarily
        # weight 12.798
        alg straw2
        hash 0  # rjenkins1
        item osd.24 weight 2.906
        item osd.25 weight 2.906
        item osd.8 weight 3.493
        item osd.12 weight 3.493
}
host pve-08 {
        id -19          # do not change unnecessarily
        id -20 class ssd                # do not change unnecessarily
        id -21 class nvme               # do not change unnecessarily
        # weight 5.812
        alg straw2
        hash 0  # rjenkins1
        item osd.26 weight 2.906
        item osd.27 weight 2.906
}
host pve-09 {
        id -22          # do not change unnecessarily
        id -23 class ssd                # do not change unnecessarily
        id -24 class nvme               # do not change unnecessarily
        # weight 5.812
        alg straw2
        hash 0  # rjenkins1
        item osd.28 weight 2.906
        item osd.29 weight 2.906
}
host pve-10 {
        id -25          # do not change unnecessarily
        id -26 class ssd                # do not change unnecessarily
        id -27 class nvme               # do not change unnecessarily
        # weight 5.812
        alg straw2
        hash 0  # rjenkins1
        item osd.30 weight 2.906
        item osd.31 weight 2.906
}
host pve-11 {
        id -3           # do not change unnecessarily
        id -4 class ssd         # do not change unnecessarily
        id -11 class nvme               # do not change unnecessarily
        # weight 12.798
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 3.493
        item osd.4 weight 3.493
        item osd.16 weight 2.906
        item osd.20 weight 2.906
}
host pve-12 {
        id -5           # do not change unnecessarily
        id -6 class ssd         # do not change unnecessarily
        id -12 class nvme               # do not change unnecessarily
        # weight 19.784
        alg straw2
        hash 0  # rjenkins1
        item osd.5 weight 3.493
        item osd.9 weight 3.493
        item osd.13 weight 3.493
        item osd.1 weight 3.493
        item osd.17 weight 2.906
        item osd.21 weight 2.906
}
host pve-13 {
        id -7           # do not change unnecessarily
        id -8 class ssd         # do not change unnecessarily
        id -13 class nvme               # do not change unnecessarily
        # weight 19.784
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 3.493
        item osd.6 weight 3.493
        item osd.10 weight 3.493
        item osd.14 weight 3.493
        item osd.18 weight 2.906
        item osd.22 weight 2.906
}
host pve-14 {
        id -9           # do not change unnecessarily
        id -10 class ssd                # do not change unnecessarily
        id -14 class nvme               # do not change unnecessarily
        # weight 19.784
        alg straw2
        hash 0  # rjenkins1
        item osd.3 weight 3.493
        item osd.7 weight 3.493
        item osd.11 weight 3.493
        item osd.15 weight 3.493
        item osd.19 weight 2.906
        item osd.23 weight 2.906
}
root default {
        id -1           # do not change unnecessarily
        id -2 class ssd         # do not change unnecessarily
        id -15 class nvme               # do not change unnecessarily
        # weight 102.382
        alg straw2
        hash 0  # rjenkins1
        item pve-07 weight 12.798
        item pve-08 weight 5.812
        item pve-09 weight 5.812
        item pve-10 weight 5.812
        item pve-11 weight 12.798
        item pve-12 weight 19.784
        item pve-13 weight 19.784
        item pve-14 weight 19.784
}
chassis chassis33 {
        id -29          # do not change unnecessarily
        id -33 class ssd                # do not change unnecessarily
        id -35 class nvme               # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
chassis chassis35 {
        id -30          # do not change unnecessarily
        id -31 class ssd                # do not change unnecessarily
        id -32 class nvme               # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
rack rack30 {
        id -28          # do not change unnecessarily
        id -34 class ssd                # do not change unnecessarily
        id -36 class nvme               # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
        item chassis33 weight 0.000
        item chassis35 weight 0.000
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated_ssd {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated_nvme {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class nvme
        step chooseleaf firstn 0 type host
        step emit
}

ceph osd tree
Code:
ID   CLASS  WEIGHT     TYPE NAME              STATUS  REWEIGHT  PRI-AFF
-28                 0  rack rack30
-29                 0      chassis chassis33
-30                 0      chassis chassis35
 -1         102.38208  root default
-16          12.79776      host pve-07
 24   nvme    2.90579          osd.24             up   1.00000  1.00000
 25   nvme    2.90579          osd.25             up   1.00000  1.00000
  8    ssd    3.49309          osd.8              up   1.00000  1.00000
 12    ssd    3.49309          osd.12             up   1.00000  1.00000
-19           5.81158      host pve-08
 26   nvme    2.90579          osd.26             up   1.00000  1.00000
 27   nvme    2.90579          osd.27             up   1.00000  1.00000
-22           5.81158      host pve-09
 28   nvme    2.90579          osd.28             up   1.00000  1.00000
 29   nvme    2.90579          osd.29             up   1.00000  1.00000
-25           5.81158      host pve-10
 30   nvme    2.90579          osd.30             up   1.00000  1.00000
 31   nvme    2.90579          osd.31             up   1.00000  1.00000
 -3          12.79776      host pve-11
 16   nvme    2.90579          osd.16             up   1.00000  1.00000
 20   nvme    2.90579          osd.20             up   1.00000  1.00000
  0    ssd    3.49309          osd.0              up   1.00000  1.00000
  4    ssd    3.49309          osd.4              up   1.00000  1.00000
 -5          19.78394      host pve-12
 17   nvme    2.90579          osd.17             up   1.00000  1.00000
 21   nvme    2.90579          osd.21             up   1.00000  1.00000
  1    ssd    3.49309          osd.1              up   1.00000  1.00000
  5    ssd    3.49309          osd.5              up   1.00000  1.00000
  9    ssd    3.49309          osd.9              up   1.00000  1.00000
 13    ssd    3.49309          osd.13             up   1.00000  1.00000
 -7          19.78394      host pve-13
 18   nvme    2.90579          osd.18             up   1.00000  1.00000
 22   nvme    2.90579          osd.22             up   1.00000  1.00000
  2    ssd    3.49309          osd.2              up   1.00000  1.00000
  6    ssd    3.49309          osd.6              up   1.00000  1.00000
 10    ssd    3.49309          osd.10             up   1.00000  1.00000
 14    ssd    3.49309          osd.14             up   1.00000  1.00000
 -9          19.78394      host pve-14
 19   nvme    2.90579          osd.19             up   1.00000  1.00000
 23   nvme    2.90579          osd.23             up   1.00000  1.00000
  3    ssd    3.49309          osd.3              up   1.00000  1.00000
  7    ssd    3.49309          osd.7              up   1.00000  1.00000
 11    ssd    3.49309          osd.11             up   1.00000  1.00000
 15    ssd    3.49309          osd.15             up   1.00000  1.00000

The error began when I changed this
Code:
rule replicated_nvme {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class nvme
         step chooseleaf firstn 0 type host
        step emit
}

to this
Code:
rule replicated_nvme {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class nvme
        step choose firstn 2 type chassis
        step chooseleaf firstn 0 type host
        step emit
}
 
Last edited:
You dont have any elements in your chassis tree.root should be the top, then rack, then chassis and then hosts and then osds.


Code:
-1         102.38208  root default
-28                 0  rack rack30
-29                 0      chassis chassis33
-16          12.79776      host pve-07
 24   nvme    2.90579          osd.24             up   1.00000  1.00000
 25   nvme    2.90579          osd.25             up   1.00000  1.00000
  8    ssd    3.49309          osd.8              up   1.00000  1.00000

https://docs.ceph.com/en/quincy/rados/operations/crush-map/ you can move all entities to the fitting place before even changing the crushrule. I would also recommend to create another crushrule instead of changing it, so you can easily switch it via ui in pool edit view.

Code:
ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
for example (taken from ceph docs) - change according to your needs, check osd tree again if everything looks fine then and then switch rules. Good luck!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!