Ceph - Backfill & Remapping processes will not finished

Quasar90

Member
Nov 24, 2021
12
1
23
35
Hello,

we have problem with the Remap backfill process of our Ceph-Cluster. We can't get our ceph back in a healthy sate. There was an outage of one node of the ceph-cluster ralated to overheating of the network controller which is dedicated for the ceph communication.

Our general Configration is a 4 node ceph-cluster with 12 HDD drives each node. 3 of them are monitor nodes. Each node has a dedicated dual 10 gbit network card for the ceph communication. These network cards are not connected directly, but they run over a dedicated switch which only handles this ceph communication.

After we cool down the overheated server, improved the cooling, the server was restarted and after that the ceph was starting the his repair. Evering looked fine at the beginning. After a few days we saw that the backfill and remapping process is not getting over a certain point and also the scrub and deep-scrub won't finish. The scrub backlog even rising slowly.

In the ceph log we see the backfill and remap progress which wont get unter 5 % remaining. If it reaches the 5 % mark it resets itself at get back to 5.6 %.
Code:
2026-02-23T16:30:15.359845+0100 mgr.genzsrp00227 (mgr.27472003) 1087138 : cluster [DBG] pgmap v1090403: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 293 KiB/s wr, 47 op/s; 2519182/50377914 objects misplaced (5.001%); 66 MiB/s, 16 objects/s recovering
2026-02-23T16:30:17.361515+0100 mgr.genzsrp00227 (mgr.27472003) 1087139 : cluster [DBG] pgmap v1090404: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 330 KiB/s wr, 50 op/s; 2519182/50377914 objects misplaced (5.001%); 54 MiB/s, 13 objects/s recovering
2026-02-23T16:30:19.363932+0100 mgr.genzsrp00227 (mgr.27472003) 1087140 : cluster [DBG] pgmap v1090405: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 244 KiB/s wr, 38 op/s; 2519043/50377914 objects misplaced (5.000%); 74 MiB/s, 18 objects/s recovering
2026-02-23T16:30:21.365387+0100 mgr.genzsrp00227 (mgr.27472003) 1087141 : cluster [DBG] pgmap v1090406: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 227 KiB/s wr, 36 op/s; 2518983/50377914 objects misplaced (5.000%); 63 MiB/s, 15 objects/s recovering
2026-02-23T16:30:23.368000+0100 mgr.genzsrp00227 (mgr.27472003) 1087142 : cluster [DBG] pgmap v1090407: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 286 KiB/s wr, 46 op/s; 2518868/50377914 objects misplaced (5.000%); 70 MiB/s, 17 objects/s recovering
2026-02-23T16:30:23.893317+0100 mon.genzsrp00226 (mon.0) 2169033 : cluster [DBG] osdmap e10559: 48 total, 48 up, 48 in
2026-02-23T16:30:24.919811+0100 mon.genzsrp00226 (mon.0) 2169034 : cluster [DBG] osdmap e10560: 48 total, 48 up, 48 in
2026-02-23T16:30:25.369241+0100 mgr.genzsrp00227 (mgr.27472003) 1087143 : cluster [DBG] pgmap v1090410: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 278 KiB/s wr, 46 op/s; 2518787/50377914 objects misplaced (5.000%); 82 MiB/s, 20 objects/s recovering
2026-02-23T16:30:25.961504+0100 mon.genzsrp00226 (mon.0) 2169037 : cluster [DBG] osdmap e10561: 48 total, 48 up, 48 in
2026-02-23T16:30:25.982845+0100 osd.27 (osd.27) 783 : cluster [DBG] 6.1ces0 starting backfill to osd.8(0) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.038020+0100 osd.27 (osd.27) 784 : cluster [DBG] 6.1ces0 starting backfill to osd.25(2) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.068392+0100 osd.27 (osd.27) 785 : cluster [DBG] 6.1ces0 starting backfill to osd.40(1) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.474535+0100 mon.genzsrp00226 (mon.0) 2169038 : cluster [DBG] osdmap e10562: 48 total, 48 up, 48 in
2026-02-23T16:30:27.370926+0100 mgr.genzsrp00227 (mgr.27472003) 1087144 : cluster [DBG] pgmap v1090413: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 1023 B/s rd, 310 KiB/s wr, 46 op/s; 2518787/50377914 objects misplaced (5.000%); 34 MiB/s, 8 objects/s recovering
2026-02-23T16:30:27.495868+0100 mon.genzsrp00226 (mon.0) 2169041 : cluster [DBG] osdmap e10563: 48 total, 48 up, 48 in
2026-02-23T16:30:27.522331+0100 osd.25 (osd.25) 854 : cluster [DBG] 6.1cfs0 starting backfill to osd.4(0) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:27.546024+0100 osd.25 (osd.25) 855 : cluster [DBG] 6.1cfs0 starting backfill to osd.30(2) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:27.559394+0100 osd.25 (osd.25) 856 : cluster [DBG] 6.1cfs0 starting backfill to osd.47(1) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:28.517567+0100 mon.genzsrp00226 (mon.0) 2169043 : cluster [DBG] osdmap e10564: 48 total, 48 up, 48 in
2026-02-23T16:30:29.372667+0100 mgr.genzsrp00227 (mgr.27472003) 1087145 : cluster [DBG] pgmap v1090416: 545 pgs: 1 unknown, 1 activating+remapped, 7 active+remapped+backfilling, 507 active+clean, 29 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 2715404/50278431 objects misplaced (5.401%); 51 MiB/s, 12 objects/s recovering
2026-02-23T16:30:29.541652+0100 mon.genzsrp00226 (mon.0) 2169044 : cluster [DBG] osdmap e10565: 48 total, 48 up, 48 in
2026-02-23T16:30:29.566328+0100 osd.1 (osd.1) 97 : cluster [DBG] 6.1d0s0 starting backfill to osd.16(2) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:29.604344+0100 osd.1 (osd.1) 98 : cluster [DBG] 6.1d0s0 starting backfill to osd.21(1) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:29.633046+0100 osd.1 (osd.1) 99 : cluster [DBG] 6.1d0s0 starting backfill to osd.37(0) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:31.374332+0100 mgr.genzsrp00227 (mgr.27472003) 1087146 : cluster [DBG] pgmap v1090418: 545 pgs: 1 activating+remapped, 7 active+remapped+backfilling, 507 active+clean, 30 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 8.2 KiB/s rd, 53 MiB/s wr, 50 op/s; 2814885/50378043 objects misplaced (5.588%); 60 MiB/s, 15 objects/s recovering
2026-02-23T16:30:33.376931+0100 mgr.genzsrp00227 (mgr.27472003) 1087147 : cluster [DBG] pgmap v1090419: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 268 KiB/s rd, 148 MiB/s wr, 331 op/s; 2814812/50378424 objects misplaced (5.587%); 78 MiB/s, 19 objects/s recovering
2026-02-23T16:30:35.378914+0100 mgr.genzsrp00227 (mgr.27472003) 1087148 : cluster [DBG] pgmap v1090420: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 205 KiB/s rd, 127 MiB/s wr, 276 op/s; 2814735/50378505 objects misplaced (5.587%); 77 MiB/s, 19 objects/s recovering
2026-02-23T16:30:37.380546+0100 mgr.genzsrp00227 (mgr.27472003) 1087149 : cluster [DBG] pgmap v1090421: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 104 MiB/s rd, 594 MiB/s wr, 1.32k op/s; 2814735/50378505 objects misplaced (5.587%); 50 MiB/s, 12 objects/s recovering
2026-02-23T16:30:39.383030+0100 mgr.genzsrp00227 (mgr.27472003) 1087150 : cluster [DBG] pgmap v1090422: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 92 MiB/s rd, 514 MiB/s wr, 1.17k op/s; 2814596/50378505 objects misplaced (5.587%); 64 MiB/s, 16 objects/s recovering
We thought maybe the scrub process interferred so we deactivated the scrub and deep-scrub for now. But even then the 5 % loop still exists. We also tried to deactivate the autoscaler for the data pool and set "noout" for the OSD's, but nothing helped.

We need some advice how we can bring back our ceph-cluster in a healthy state.
 
Hey,

the 5% sound heavily like the default target_max_misplaced_ratio value.
Can you check with:

ceph config get mgr target_max_misplaced_ratio

Not sure if the autoscaler (Which you disbaled) is at fault here, check

ceph osd pool ls detail

if pg_num and pgp_num match, it is not the autoscaler but rather the normal ceph balancer.
Both are using the target_max_misplaced_ratio to rebalance the data.

When your OSDs came back online, the balancer is rebalacing around the 5% of your data.

ceph balancer status

If I understood it correct, this is what is happening right now:
  1. The cluster backfills data until the total misplaced objects drop to exactly 5.000%.
  2. The mgr sees the ratio has hit the safe threshold and updates the osdmap with new balancer optimizations to move data to the recovered node.
  3. Because the placement rules just shifted, new misplaced objects are created, bumping your percentage back up (in your case, to around 5.5%).
  4. The OSDs begin backfilling to the new locations until the ratio drops to 5.000% again.
  5. Cycle repeats

We could verify this, by disabling the balance temporarily, if the misplaced ratio goes below 5% and even to 0%,
it definetly is the balancer.

ceph balancer off

Turn it on, after you checked.

So two options, either you wait a bit longer OR you increase the target_max_misplaced_ratio with

ceph config set mgr target_max_misplaced_ratio .10

EDIT: You can also directly increase the ratio, without disabling the balancer. The ratio should directly increase to around 10%.
This would also proof the balancer is the "cause".

You can go to around 15-20% (if you want to speed up things), but I would not recommend higher, as you only have HDDs which could lead to decreased client I/O performance. 15-20% only if you have low client load.


If this is not the case, we need to dig deeper (Only do, if nothing of the previous checked out for you):

If you really suspect just a stuck backfill, you could set noout, nobackfill and norebalance and wait for a few minutes.
Unset them and look if it happens again.

Do you maybe encounter network problems / packet loss, maybe the card took some damage due to overheating?
Could you please share the output of following commands:

- ceph -v
- ceph health detail
- ceph osd df tree
- ceph df detail
- ceph -s

Anything in the kernel logs (command: dmesg)?
TIme properly synchronized between the nodes? (timedatectl)


Cheers :)
 
Last edited:
Hi @Khensu s,

thanks for your reply. I appreciate your solution steps and will definitly safe them for later.
In the meantime our ceph magicly seems to repair itself o_O. At the time of this reply the backfill and remap is already down to 0.1 % and is going forward.
I wait if the process will finish. If so, i will reactivate the scrub process and wait if this will also finish.

Thanks a lot
 
Hi @Quasar90,

you are welcome. It was fun exploring the ceph docs regarding this behavior :)

The fact that CEPH is healing itself here confirms my previous statement. That was probably just the balancer and the 5% limit, which took some time.

But you have the balancer and the autoscaler active, right? Because if you deactivate the balancer, it can also drop to 0% and your cluster can be healthy, but the data may not be evenly distributed. (One node may then have significantly more PGs or significantly fewer than others, for example).