I’m running a Ceph cluster in a Proxmox environment (3 nodes) and am having issues with my cluster not finishing its rebuild. A large number of PGs remain stuck in active+cleaned+remapped, and I’m seeing a high percentage of misplaced objects that just won’t seem to move. I’d greatly appreciate any advice or pointers on what to do next.
The Problem
• Majority of PGs in the cluster are either healthy (active+clean) or stuck in a remapped state (active+clean+remapped).
• Over 70% of the objects are currently misplaced, and the system doesn’t seem to make progress.
• The cluster has been in this state for a couple of days, and I’m unsure how to kick it back into rebalancing properly.
• There’s also a health warning about “151 pgs not deep-scrubbed in time,” but the bigger issue seems to be the incomplete rebuild/remap.
What I’ve Tried
1. Ensuring all OSDs are in and up.
2. Checking for network issues or disk I/O bottlenecks.
3. Verifying CRUSH rules look correct (they appear OK, but I’m open to suggestions).
4. Monitoring for any stuck OSD operations or slow requests (haven’t seen anything conclusive).
Despite these checks, the cluster remains in this half-rebuilt state.
Request for Help
• Has anyone encountered a similar issue where Ceph would not finish rebalancing due to large amounts of data being stuck in active+clean+remapped?
• What additional configs or logs should I post or check to diagnose the bottleneck or misconfiguration?
• Are there any recommended ceph.conf tweaks or commands (like ceph tell osd.* injectargs ...) that might help speed up or unlock the rebalancing?
• Could the erasure-coding rules or CRUSH map weighting be causing excessive backfilling or an imbalance?
Any tips, best practices, or next steps would be greatly appreciated. If you need more info (e.g., Proxmox/ceph.conf settings, logs, or hardware details), let me know, and I’ll provide it.
Thanks in advance for any guidance you can offer!
Below are some details and outputs from my cluster:
The Problem
• Majority of PGs in the cluster are either healthy (active+clean) or stuck in a remapped state (active+clean+remapped).
• Over 70% of the objects are currently misplaced, and the system doesn’t seem to make progress.
• The cluster has been in this state for a couple of days, and I’m unsure how to kick it back into rebalancing properly.
• There’s also a health warning about “151 pgs not deep-scrubbed in time,” but the bigger issue seems to be the incomplete rebuild/remap.
What I’ve Tried
1. Ensuring all OSDs are in and up.
2. Checking for network issues or disk I/O bottlenecks.
3. Verifying CRUSH rules look correct (they appear OK, but I’m open to suggestions).
4. Monitoring for any stuck OSD operations or slow requests (haven’t seen anything conclusive).
Despite these checks, the cluster remains in this half-rebuilt state.
Request for Help
• Has anyone encountered a similar issue where Ceph would not finish rebalancing due to large amounts of data being stuck in active+clean+remapped?
• What additional configs or logs should I post or check to diagnose the bottleneck or misconfiguration?
• Are there any recommended ceph.conf tweaks or commands (like ceph tell osd.* injectargs ...) that might help speed up or unlock the rebalancing?
• Could the erasure-coding rules or CRUSH map weighting be causing excessive backfilling or an imbalance?
Any tips, best practices, or next steps would be greatly appreciated. If you need more info (e.g., Proxmox/ceph.conf settings, logs, or hardware details), let me know, and I’ll provide it.
Thanks in advance for any guidance you can offer!
Below are some details and outputs from my cluster:
Code:
root@pve1:~# ceph -s
cluster:
id: c6fa9005-6814-40af-a7a9-dc09cfcf21cf
health: HEALTH_WARN
151 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum pve4,pve3,pve1 (age 45h)
mgr: pve1(active, since 38h), standbys: pve4, pve3
mds: 1/1 daemons up, 2 standby
osd: 33 osds: 33 up (since 38h), 33 in (since 2d); 174 remapped pgs
data:
volumes: 1/1 healthy
pools: 9 pools, 687 pgs
objects: 37.90M objects, 144 TiB
usage: 230 TiB used, 189 TiB / 420 TiB avail
pgs: 320831397/432349875 objects misplaced (74.206%)
511 active+clean
166 active+clean+remapped
5 active+clean+remapped+scrubbing
3 active+clean+remapped+scrubbing+deep
1 active+clean+scrubbing
1 active+clean+scrubbing+deep
io:
client: 1.4 MiB/s rd, 54 MiB/s wr, 16 op/s rd, 54 op/s w
ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 403 TiB 177 TiB 226 TiB 226 TiB 56.09
nvme 16 TiB 13 TiB 3.5 TiB 3.5 TiB 21.32
TOTAL 420 TiB 190 TiB 230 TiB 230 TiB 54.73
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 73 MiB 20 218 MiB 0 3.2 TiB
nvme 2 128 1.7 TiB 457.27k 3.5 TiB 26.37 4.8 TiB
hdd 3 174 135 TiB 35.57M 203 TiB 74.89 45 TiB
cephfs_metadata 6 128 224 MiB 71.17k 672 MiB 0 3.2 TiB
hdd_replicated 7 128 6.5 TiB 1.72M 13 TiB 16.12 34 TiB
.rgw.root 8 32 1.4 KiB 4 48 KiB 0 3.2 TiB
default.rgw.log 9 32 3.6 KiB 209 408 KiB 0 3.2 TiB
default.rgw.control 10 32 0 B 8 0 B 0 3.2 TiB
default.rgw.meta 11 32 0 B 0 0 B 0 3.2 TiB
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 200
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host pve4 {
id -3 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
id -7 class hdd # do not change unnecessarily
# weight 136.42641
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.81926
item osd.1 weight 1.81926
item osd.2 weight 1.81926
item osd.6 weight 16.37108
item osd.7 weight 16.37108
item osd.8 weight 16.37108
item osd.9 weight 16.37108
item osd.10 weight 16.37108
item osd.11 weight 16.37108
item osd.12 weight 16.37108
item osd.13 weight 16.37108
}
host pve1 {
id -5 # do not change unnecessarily
id -6 class nvme # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 136.42641
alg straw2
hash 0 # rjenkins1
item osd.3 weight 1.81926
item osd.4 weight 1.81926
item osd.5 weight 1.81926
item osd.25 weight 16.37108
item osd.26 weight 16.37108
item osd.27 weight 16.37108
item osd.28 weight 16.37108
item osd.29 weight 16.37108
item osd.30 weight 16.37108
item osd.31 weight 16.37108
item osd.32 weight 16.37108
}
host pve3 {
id -10 # do not change unnecessarily
id -11 class nvme # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 136.42641
alg straw2
hash 0 # rjenkins1
item osd.14 weight 16.37108
item osd.15 weight 16.37108
item osd.16 weight 16.37108
item osd.17 weight 16.37108
item osd.18 weight 16.37108
item osd.19 weight 16.37108
item osd.20 weight 16.37108
item osd.21 weight 16.37108
item osd.22 weight 1.81926
item osd.23 weight 1.81926
item osd.24 weight 1.81926
}
root default {
id -1 # do not change unnecessarily
id -2 class nvme # do not change unnecessarily
id -9 class hdd # do not change unnecessarily
# weight 409.27972
alg straw2
hash 0 # rjenkins1
item pve4 weight 136.42656
item pve1 weight 136.42653
item pve3 weight 136.42662
}
# rules
rule replicated_rule {
id 0
type replicated
step take default class nvme
step chooseleaf firstn 0 type host
step emit
}
rule ec_rule_hdd {
id 1
type erasure
step set_chooseleaf_tries 100
step set_choose_tries 200
step take default class hdd
step chooseleaf indep 3 type host
step emit
}
rule replicated_rule_hdd {
id 4
type replicated
step take default class hdd
step chooseleaf firstn 3 type host
step emit
}
# end crush map
ceph osd erasure-code-profile get ec-profile-hd
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=8
m=4
plugin=jerasure
technique=reed_sol_van
w=8