Ceph Cluster Won’t Finish Rebuild (Many PGs Stuck in active+cleaned+remapped) — Need Help!

ralphte

New Member
Mar 16, 2024
2
0
1
I’m running a Ceph cluster in a Proxmox environment (3 nodes) and am having issues with my cluster not finishing its rebuild. A large number of PGs remain stuck in active+cleaned+remapped, and I’m seeing a high percentage of misplaced objects that just won’t seem to move. I’d greatly appreciate any advice or pointers on what to do next.

The Problem

Majority of PGs in the cluster are either healthy (active+clean) or stuck in a remapped state (active+clean+remapped).

Over 70% of the objects are currently misplaced, and the system doesn’t seem to make progress.

• The cluster has been in this state for a couple of days, and I’m unsure how to kick it back into rebalancing properly.

• There’s also a health warning about “151 pgs not deep-scrubbed in time,” but the bigger issue seems to be the incomplete rebuild/remap.

What I’ve Tried

1. Ensuring all OSDs are in and up.

2. Checking for network issues or disk I/O bottlenecks.

3. Verifying CRUSH rules look correct (they appear OK, but I’m open to suggestions).

4. Monitoring for any stuck OSD operations or slow requests (haven’t seen anything conclusive).
Despite these checks, the cluster remains in this half-rebuilt state.

Request for Help

Has anyone encountered a similar issue where Ceph would not finish rebalancing due to large amounts of data being stuck in active+clean+remapped?

What additional configs or logs should I post or check to diagnose the bottleneck or misconfiguration?

Are there any recommended ceph.conf tweaks or commands (like ceph tell osd.* injectargs ...) that might help speed up or unlock the rebalancing?

Could the erasure-coding rules or CRUSH map weighting be causing excessive backfilling or an imbalance?

Any tips, best practices, or next steps would be greatly appreciated. If you need more info (e.g., Proxmox/ceph.conf settings, logs, or hardware details), let me know, and I’ll provide it.

Thanks in advance for any guidance you can offer!

Below are some details and outputs from my cluster:

Code:
root@pve1:~# ceph -s
  cluster:
    id:     c6fa9005-6814-40af-a7a9-dc09cfcf21cf
    health: HEALTH_WARN
            151 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum pve4,pve3,pve1 (age 45h)
    mgr: pve1(active, since 38h), standbys: pve4, pve3
    mds: 1/1 daemons up, 2 standby
    osd: 33 osds: 33 up (since 38h), 33 in (since 2d); 174 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 687 pgs
    objects: 37.90M objects, 144 TiB
    usage:   230 TiB used, 189 TiB / 420 TiB avail
    pgs:     320831397/432349875 objects misplaced (74.206%)
             511 active+clean
             166 active+clean+remapped
             5   active+clean+remapped+scrubbing
             3   active+clean+remapped+scrubbing+deep
             1   active+clean+scrubbing
             1   active+clean+scrubbing+deep

  io:
    client:   1.4 MiB/s rd, 54 MiB/s wr, 16 op/s rd, 54 op/s w

ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    403 TiB  177 TiB  226 TiB   226 TiB      56.09
nvme    16 TiB   13 TiB  3.5 TiB   3.5 TiB      21.32
TOTAL  420 TiB  190 TiB  230 TiB   230 TiB      54.73

--- POOLS ---
POOL                 ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                  1    1   73 MiB       20  218 MiB      0    3.2 TiB
nvme                  2  128  1.7 TiB  457.27k  3.5 TiB  26.37    4.8 TiB
hdd                   3  174  135 TiB   35.57M  203 TiB  74.89     45 TiB
cephfs_metadata       6  128  224 MiB   71.17k  672 MiB      0    3.2 TiB
hdd_replicated        7  128  6.5 TiB    1.72M   13 TiB  16.12     34 TiB
.rgw.root             8   32  1.4 KiB        4   48 KiB      0    3.2 TiB
default.rgw.log       9   32  3.6 KiB      209  408 KiB      0    3.2 TiB
default.rgw.control  10   32      0 B        8      0 B      0    3.2 TiB
default.rgw.meta     11   32      0 B        0      0 B      0    3.2 TiB

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 200
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve4 {
  id -3   # do not change unnecessarily
  id -4 class nvme    # do not change unnecessarily
  id -7 class hdd   # do not change unnecessarily
  # weight 136.42641
  alg straw2
  hash 0  # rjenkins1
  item osd.0 weight 1.81926
  item osd.1 weight 1.81926
  item osd.2 weight 1.81926
  item osd.6 weight 16.37108
  item osd.7 weight 16.37108
  item osd.8 weight 16.37108
  item osd.9 weight 16.37108
  item osd.10 weight 16.37108
  item osd.11 weight 16.37108
  item osd.12 weight 16.37108
  item osd.13 weight 16.37108
}
host pve1 {
  id -5   # do not change unnecessarily
  id -6 class nvme    # do not change unnecessarily
  id -8 class hdd   # do not change unnecessarily
  # weight 136.42641
  alg straw2
  hash 0  # rjenkins1
  item osd.3 weight 1.81926
  item osd.4 weight 1.81926
  item osd.5 weight 1.81926
  item osd.25 weight 16.37108
  item osd.26 weight 16.37108
  item osd.27 weight 16.37108
  item osd.28 weight 16.37108
  item osd.29 weight 16.37108
  item osd.30 weight 16.37108
  item osd.31 weight 16.37108
  item osd.32 weight 16.37108
}
host pve3 {
  id -10    # do not change unnecessarily
  id -11 class nvme   # do not change unnecessarily
  id -12 class hdd    # do not change unnecessarily
  # weight 136.42641
  alg straw2
  hash 0  # rjenkins1
  item osd.14 weight 16.37108
  item osd.15 weight 16.37108
  item osd.16 weight 16.37108
  item osd.17 weight 16.37108
  item osd.18 weight 16.37108
  item osd.19 weight 16.37108
  item osd.20 weight 16.37108
  item osd.21 weight 16.37108
  item osd.22 weight 1.81926
  item osd.23 weight 1.81926
  item osd.24 weight 1.81926
}
root default {
  id -1   # do not change unnecessarily
  id -2 class nvme    # do not change unnecessarily
  id -9 class hdd   # do not change unnecessarily
  # weight 409.27972
  alg straw2
  hash 0  # rjenkins1
  item pve4 weight 136.42656
  item pve1 weight 136.42653
  item pve3 weight 136.42662
}

# rules
rule replicated_rule {
  id 0
  type replicated
  step take default class nvme
  step chooseleaf firstn 0 type host
  step emit
}
rule ec_rule_hdd {
  id 1
  type erasure
  step set_chooseleaf_tries 100
  step set_choose_tries 200
  step take default class hdd
  step chooseleaf indep 3 type host
  step emit
}
rule replicated_rule_hdd {
  id 4
  type replicated
  step take default class hdd
  step chooseleaf firstn 3 type host
  step emit
}

# end crush map

ceph osd erasure-code-profile get ec-profile-hd
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=8
m=4
plugin=jerasure
technique=reed_sol_van
w=8
 
Hi @ralphte

I am not a Ceph expert, but to me it looks as if you removed and re-added a host with the existing OSDs to the cluster. If you do that the host might change its id, that will make the crush algorithm to produce a different desired placement for a lot of pgs. I see you don't have any backfills, which means that the data are still available and redundant, just not in the right place according to the algorithm. That also means that any parameters for backfill priority will likely not work. And I believe Ceph is not considering the remapping a priority task at all... You can check if the number of misplaced object decreases, if it does, you may just let it do its job.
You can ignore the scrubbing warning, it will catch up after the remapping is completed...
Sorry, don't know any specific working commands to increase the remapping speed... You might try the commands from here: https://www.suse.com/support/kb/doc/?id=000019693 but again I am not sure if they are going to help at all as you are not doing backfills or recovery...

I had the similar situation (but not at your scale), and I just painfully waited for Ceph to finish its course...
If anybody have a relevant experience increasing the remapping priority, would like to hear as well...

If you have your old crush map, you can try to set the id of the host (that negative number) to the original value and see if that reduces the number of misplaced objects... A warning, I have never tried that, I just believe it might work...