I'll start at the beginning because I'm not sure where I screwed up.
I had originally set up Ceph and a data pool and all seemed to be working, however I needed to change the CRUSH map to make sure copies couldn't be kept on the same server chassis in case of power supply failure (I have 8 servers, two per chassis with shared power supplies). Here's what I ran to make that happen:
After doing this, I verified that PGs for the data pool (named CEPH-Pool) were spread across the cluster as I expected, and they were. This cluster is still being commissioned so there was only a single test VM on the pool.
From there, I determined that I didn't need three copies of everything, as that was eating a little too much of my available space. I went into the web GUI, Ceph -> Pools, and edited the pool "CEPH-Pool" from the default size of 3 down to 2. That's where things began to break: Under the performance monitoring, Ceph was showing -166.667% for rebalancing, showing something like 29425/17655 objects were misplaced, and the cluster storage was now showing as full.
I dove into the console to try to figure something out, and after trying many, many things to force Ceph to delete the extra copies or rebalance I decided to simply destroy the pool and start over. I couldn't delete it from the web GUI due to it trying to look for RBD images and hanging (due to not being able to access any data on the pool, I'm guessing), so I manually destroyed it with
I've tried destroying it and recreating it a few times now, however the PG never leaves "Unknown" status and trying to repair gives me "pg <#> has no primary osd". Any time I try to create a new pool, the PGs never get placed onto OSDs, which makes me think that I somehow screwed up the CRUSH map or somehow the OSD mapping. Can anyone tell me what went wrong and how to fix this?
Here's my CRUSH map:
Here's the OSD tree:
And here's the stuck PG:
Any thoughts? Thanks in advance.
I had originally set up Ceph and a data pool and all seemed to be working, however I needed to change the CRUSH map to make sure copies couldn't be kept on the same server chassis in case of power supply failure (I have 8 servers, two per chassis with shared power supplies). Here's what I ran to make that happen:
Bash:
root@pves01:~# ceph osd crush add-bucket chasis1 row
added bucket chasis1 type row to crush map
root@pves01:~# ceph osd crush add-bucket chasis2 row
added bucket chasis2 type row to crush map
root@pves01:~# ceph osd crush add-bucket chasis3 row
added bucket chasis3 type row to crush map
root@pves01:~# ceph osd crush add-bucket chasis4 row
added bucket chasis4 type row to crush map
root@pves01:~# ceph osd crush move pves01 row=chasis1
moved item id -3 name 'pves01' to location {row=chasis1} in crush map
root@pves01:~# ceph osd crush move pves02 row=chasis1
moved item id -5 name 'pves02' to location {row=chasis1} in crush map
root@pves01:~# ceph osd crush move pves03 row=chasis2
moved item id -7 name 'pves03' to location {row=chasis2} in crush map
root@pves01:~# ceph osd crush move pves04 row=chasis2
moved item id -9 name 'pves04' to location {row=chasis2} in crush map
root@pves01:~# ceph osd crush move pves05 row=chasis3
moved item id -11 name 'pves05' to location {row=chasis3} in crush map
root@pves01:~# ceph osd crush move pves06 row=chasis3
moved item id -13 name 'pves06' to location {row=chasis3} in crush map
root@pves01:~# ceph osd crush move pves07 row=chasis4
moved item id -15 name 'pves07' to location {row=chasis4} in crush map
root@pves01:~# ceph osd crush move pves08 row=chasis4
moved item id -17 name 'pves08' to location {row=chasis4} in crush map
root@pves01:~# ceph osd crush rule create-replicated chasis_rule default row
root@pves01:~# ceph osd pool set CEPH-Pool crush_rule chasis_rule
set pool 2 crush_rule to chasis_rule
root@pves01:~#
After doing this, I verified that PGs for the data pool (named CEPH-Pool) were spread across the cluster as I expected, and they were. This cluster is still being commissioned so there was only a single test VM on the pool.
From there, I determined that I didn't need three copies of everything, as that was eating a little too much of my available space. I went into the web GUI, Ceph -> Pools, and edited the pool "CEPH-Pool" from the default size of 3 down to 2. That's where things began to break: Under the performance monitoring, Ceph was showing -166.667% for rebalancing, showing something like 29425/17655 objects were misplaced, and the cluster storage was now showing as full.
I dove into the console to try to figure something out, and after trying many, many things to force Ceph to delete the extra copies or rebalance I decided to simply destroy the pool and start over. I couldn't delete it from the web GUI due to it trying to look for RBD images and hanging (due to not being able to access any data on the pool, I'm guessing), so I manually destroyed it with
ceph osd pool rm CEPH-Pool CEPH-Pool --yes-i-really-really-mean-it . I deleted it from the cluster storage locations and thought that was that. I created a new pool with a size of 2 and my "chasis_rule" for the CRUSH rule, however all PGs remained in "Unknown" state and never got placed onto OSDs. I deleted that pool, again manually, then noticed I still had errors of "15/9 objects misplaced." I tracked that down to the .mgr pool, which I hadn't touched up to this point. I found this post on how to delete and recreate the .mgr pool and followed it, successfully recreating the .mgr pool however now with 1 PG stuck in "Unknown."I've tried destroying it and recreating it a few times now, however the PG never leaves "Unknown" status and trying to repair gives me "pg <#> has no primary osd". Any time I try to create a new pool, the PGs never get placed onto OSDs, which makes me think that I somehow screwed up the CRUSH map or somehow the OSD mapping. Can anyone tell me what went wrong and how to fix this?
Here's my CRUSH map:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.00000
alg straw2
hash 0 # rjenkins1
}
host pves01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.81940
item osd.1 weight 1.81940
item osd.2 weight 1.81940
item osd.3 weight 1.81940
}
host pves02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.4 weight 1.81940
item osd.5 weight 1.81940
item osd.6 weight 1.81940
item osd.7 weight 1.81940
}
host pves03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.8 weight 1.81940
item osd.9 weight 1.81940
item osd.10 weight 1.81940
item osd.11 weight 1.81940
}
host pves04 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.81940
item osd.13 weight 1.81940
item osd.14 weight 1.81940
item osd.15 weight 1.81940
}
host pves05 {
id -11 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.16 weight 1.81940
item osd.17 weight 1.81940
item osd.18 weight 1.81940
item osd.19 weight 1.81940
}
host pves06 {
id -13 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.20 weight 1.81940
item osd.21 weight 1.81940
item osd.22 weight 1.81940
item osd.23 weight 1.81940
}
host pves07 {
id -15 # do not change unnecessarily
id -16 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.24 weight 1.81940
item osd.25 weight 1.81940
item osd.26 weight 1.81940
item osd.27 weight 1.81940
}
host pves08 {
id -17 # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
# weight 7.27759
alg straw2
hash 0 # rjenkins1
item osd.28 weight 1.81940
item osd.29 weight 1.81940
item osd.30 weight 1.81940
item osd.31 weight 1.81940
}
row chasis1 {
id -19 # do not change unnecessarily
id -26 class hdd # do not change unnecessarily
# weight 14.55518
alg straw2
hash 0 # rjenkins1
item pves01 weight 7.27759
item pves02 weight 7.27759
}
row chasis2 {
id -20 # do not change unnecessarily
id -25 class hdd # do not change unnecessarily
# weight 14.55518
alg straw2
hash 0 # rjenkins1
item pves03 weight 7.27759
item pves04 weight 7.27759
}
row chasis3 {
id -21 # do not change unnecessarily
id -24 class hdd # do not change unnecessarily
# weight 14.55518
alg straw2
hash 0 # rjenkins1
item pves05 weight 7.27759
item pves06 weight 7.27759
}
row chasis4 {
id -22 # do not change unnecessarily
id -23 class hdd # do not change unnecessarily
# weight 14.55518
alg straw2
hash 0 # rjenkins1
item pves07 weight 7.27759
item pves08 weight 7.27759
}
# rules
rule host_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule chasis_rule {
id 1
type replicated
step take default
step chooseleaf firstn 0 type row
step emit
}
# end crush map
Here's the OSD tree:
Code:
root@pves01:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-22 14.55518 row chasis4
-15 7.27759 host pves07
24 hdd 1.81940 osd.24 up 0.95001 1.00000
25 hdd 1.81940 osd.25 up 1.00000 1.00000
26 hdd 1.81940 osd.26 up 1.00000 1.00000
27 hdd 1.81940 osd.27 up 1.00000 1.00000
-17 7.27759 host pves08
28 hdd 1.81940 osd.28 up 1.00000 1.00000
29 hdd 1.81940 osd.29 up 1.00000 1.00000
30 hdd 1.81940 osd.30 up 1.00000 1.00000
31 hdd 1.81940 osd.31 up 1.00000 1.00000
-21 14.55518 row chasis3
-11 7.27759 host pves05
16 hdd 1.81940 osd.16 up 1.00000 1.00000
17 hdd 1.81940 osd.17 up 0.95001 1.00000
18 hdd 1.81940 osd.18 up 1.00000 1.00000
19 hdd 1.81940 osd.19 up 1.00000 1.00000
-13 7.27759 host pves06
20 hdd 1.81940 osd.20 up 1.00000 1.00000
21 hdd 1.81940 osd.21 up 1.00000 1.00000
22 hdd 1.81940 osd.22 up 1.00000 1.00000
23 hdd 1.81940 osd.23 up 1.00000 1.00000
-20 14.55518 row chasis2
-7 7.27759 host pves03
8 hdd 1.81940 osd.8 up 0.95001 1.00000
9 hdd 1.81940 osd.9 up 1.00000 1.00000
10 hdd 1.81940 osd.10 up 1.00000 1.00000
11 hdd 1.81940 osd.11 up 1.00000 1.00000
-9 7.27759 host pves04
12 hdd 1.81940 osd.12 up 1.00000 1.00000
13 hdd 1.81940 osd.13 up 1.00000 1.00000
14 hdd 1.81940 osd.14 up 1.00000 1.00000
15 hdd 1.81940 osd.15 up 1.00000 1.00000
-19 14.55518 row chasis1
-3 7.27759 host pves01
0 hdd 1.81940 osd.0 up 1.00000 1.00000
1 hdd 1.81940 osd.1 up 1.00000 1.00000
2 hdd 1.81940 osd.2 up 1.00000 1.00000
3 hdd 1.81940 osd.3 up 1.00000 1.00000
-5 7.27759 host pves02
4 hdd 1.81940 osd.4 up 1.00000 1.00000
5 hdd 1.81940 osd.5 up 1.00000 1.00000
6 hdd 1.81940 osd.6 up 1.00000 1.00000
7 hdd 1.81940 osd.7 up 0.95001 1.00000
-1 0 root default
root@pves01:~#
And here's the stuck PG:
Code:
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING
6.0 0 0 0 0 0 0 0 0 0 unknown 6m 0'0 0:0 []p-1 []p-1 2025-03-03T13:26:18.239533-0600 2025-03-03T13:26:18.239533-0600 0 --
Any thoughts? Thanks in advance.