Ceph - Backfill & Remapping processes will not finished

Quasar90

Member
Nov 24, 2021
11
1
23
35
Hello,

we have problem with the Remap backfill process of our Ceph-Cluster. We can't get our ceph back in a healthy sate. There was an outage of one node of the ceph-cluster ralated to overheating of the network controller which is dedicated for the ceph communication.

Our general Configration is a 4 node ceph-cluster with 12 HDD drives each node. 3 of them are monitor nodes. Each node has a dedicated dual 10 gbit network card for the ceph communication. These network cards are not connected directly, but they run over a dedicated switch which only handles this ceph communication.

After we cool down the overheated server, improved the cooling, the server was restarted and after that the ceph was starting the his repair. Evering looked fine at the beginning. After a few days we saw that the backfill and remapping process is not getting over a certain point and also the scrub and deep-scrub won't finish. The scrub backlog even rising slowly.

In the ceph log we see the backfill and remap progress which wont get unter 5 % remaining. If it reaches the 5 % mark it resets itself at get back to 5.6 %.
Code:
2026-02-23T16:30:15.359845+0100 mgr.genzsrp00227 (mgr.27472003) 1087138 : cluster [DBG] pgmap v1090403: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 293 KiB/s wr, 47 op/s; 2519182/50377914 objects misplaced (5.001%); 66 MiB/s, 16 objects/s recovering
2026-02-23T16:30:17.361515+0100 mgr.genzsrp00227 (mgr.27472003) 1087139 : cluster [DBG] pgmap v1090404: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 330 KiB/s wr, 50 op/s; 2519182/50377914 objects misplaced (5.001%); 54 MiB/s, 13 objects/s recovering
2026-02-23T16:30:19.363932+0100 mgr.genzsrp00227 (mgr.27472003) 1087140 : cluster [DBG] pgmap v1090405: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 244 KiB/s wr, 38 op/s; 2519043/50377914 objects misplaced (5.000%); 74 MiB/s, 18 objects/s recovering
2026-02-23T16:30:21.365387+0100 mgr.genzsrp00227 (mgr.27472003) 1087141 : cluster [DBG] pgmap v1090406: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 227 KiB/s wr, 36 op/s; 2518983/50377914 objects misplaced (5.000%); 63 MiB/s, 15 objects/s recovering
2026-02-23T16:30:23.368000+0100 mgr.genzsrp00227 (mgr.27472003) 1087142 : cluster [DBG] pgmap v1090407: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 286 KiB/s wr, 46 op/s; 2518868/50377914 objects misplaced (5.000%); 70 MiB/s, 17 objects/s recovering
2026-02-23T16:30:23.893317+0100 mon.genzsrp00226 (mon.0) 2169033 : cluster [DBG] osdmap e10559: 48 total, 48 up, 48 in
2026-02-23T16:30:24.919811+0100 mon.genzsrp00226 (mon.0) 2169034 : cluster [DBG] osdmap e10560: 48 total, 48 up, 48 in
2026-02-23T16:30:25.369241+0100 mgr.genzsrp00227 (mgr.27472003) 1087143 : cluster [DBG] pgmap v1090410: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 278 KiB/s wr, 46 op/s; 2518787/50377914 objects misplaced (5.000%); 82 MiB/s, 20 objects/s recovering
2026-02-23T16:30:25.961504+0100 mon.genzsrp00226 (mon.0) 2169037 : cluster [DBG] osdmap e10561: 48 total, 48 up, 48 in
2026-02-23T16:30:25.982845+0100 osd.27 (osd.27) 783 : cluster [DBG] 6.1ces0 starting backfill to osd.8(0) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.038020+0100 osd.27 (osd.27) 784 : cluster [DBG] 6.1ces0 starting backfill to osd.25(2) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.068392+0100 osd.27 (osd.27) 785 : cluster [DBG] 6.1ces0 starting backfill to osd.40(1) from (0'0,0'0] MAX to 10551'8522533
2026-02-23T16:30:26.474535+0100 mon.genzsrp00226 (mon.0) 2169038 : cluster [DBG] osdmap e10562: 48 total, 48 up, 48 in
2026-02-23T16:30:27.370926+0100 mgr.genzsrp00227 (mgr.27472003) 1087144 : cluster [DBG] pgmap v1090413: 545 pgs: 7 active+remapped+backfilling, 510 active+clean, 28 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 1023 B/s rd, 310 KiB/s wr, 46 op/s; 2518787/50377914 objects misplaced (5.000%); 34 MiB/s, 8 objects/s recovering
2026-02-23T16:30:27.495868+0100 mon.genzsrp00226 (mon.0) 2169041 : cluster [DBG] osdmap e10563: 48 total, 48 up, 48 in
2026-02-23T16:30:27.522331+0100 osd.25 (osd.25) 854 : cluster [DBG] 6.1cfs0 starting backfill to osd.4(0) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:27.546024+0100 osd.25 (osd.25) 855 : cluster [DBG] 6.1cfs0 starting backfill to osd.30(2) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:27.559394+0100 osd.25 (osd.25) 856 : cluster [DBG] 6.1cfs0 starting backfill to osd.47(1) from (0'0,0'0] MAX to 10535'7035790
2026-02-23T16:30:28.517567+0100 mon.genzsrp00226 (mon.0) 2169043 : cluster [DBG] osdmap e10564: 48 total, 48 up, 48 in
2026-02-23T16:30:29.372667+0100 mgr.genzsrp00227 (mgr.27472003) 1087145 : cluster [DBG] pgmap v1090416: 545 pgs: 1 unknown, 1 activating+remapped, 7 active+remapped+backfilling, 507 active+clean, 29 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 2715404/50278431 objects misplaced (5.401%); 51 MiB/s, 12 objects/s recovering
2026-02-23T16:30:29.541652+0100 mon.genzsrp00226 (mon.0) 2169044 : cluster [DBG] osdmap e10565: 48 total, 48 up, 48 in
2026-02-23T16:30:29.566328+0100 osd.1 (osd.1) 97 : cluster [DBG] 6.1d0s0 starting backfill to osd.16(2) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:29.604344+0100 osd.1 (osd.1) 98 : cluster [DBG] 6.1d0s0 starting backfill to osd.21(1) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:29.633046+0100 osd.1 (osd.1) 99 : cluster [DBG] 6.1d0s0 starting backfill to osd.37(0) from (0'0,0'0] MAX to 10558'5661303
2026-02-23T16:30:31.374332+0100 mgr.genzsrp00227 (mgr.27472003) 1087146 : cluster [DBG] pgmap v1090418: 545 pgs: 1 activating+remapped, 7 active+remapped+backfilling, 507 active+clean, 30 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 8.2 KiB/s rd, 53 MiB/s wr, 50 op/s; 2814885/50378043 objects misplaced (5.588%); 60 MiB/s, 15 objects/s recovering
2026-02-23T16:30:33.376931+0100 mgr.genzsrp00227 (mgr.27472003) 1087147 : cluster [DBG] pgmap v1090419: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 268 KiB/s rd, 148 MiB/s wr, 331 op/s; 2814812/50378424 objects misplaced (5.587%); 78 MiB/s, 19 objects/s recovering
2026-02-23T16:30:35.378914+0100 mgr.genzsrp00227 (mgr.27472003) 1087148 : cluster [DBG] pgmap v1090420: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 205 KiB/s rd, 127 MiB/s wr, 276 op/s; 2814735/50378505 objects misplaced (5.587%); 77 MiB/s, 19 objects/s recovering
2026-02-23T16:30:37.380546+0100 mgr.genzsrp00227 (mgr.27472003) 1087149 : cluster [DBG] pgmap v1090421: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 104 MiB/s rd, 594 MiB/s wr, 1.32k op/s; 2814735/50378505 objects misplaced (5.587%); 50 MiB/s, 12 objects/s recovering
2026-02-23T16:30:39.383030+0100 mgr.genzsrp00227 (mgr.27472003) 1087150 : cluster [DBG] pgmap v1090422: 545 pgs: 7 active+remapped+backfilling, 507 active+clean, 31 active+remapped+backfill_wait; 64 TiB data, 97 TiB used, 252 TiB / 349 TiB avail; 92 MiB/s rd, 514 MiB/s wr, 1.17k op/s; 2814596/50378505 objects misplaced (5.587%); 64 MiB/s, 16 objects/s recovering
We thought maybe the scrub process interferred so we deactivated the scrub and deep-scrub for now. But even then the 5 % loop still exists. We also tried to deactivate the autoscaler for the data pool and set "noout" for the OSD's, but nothing helped.

We need some advice how we can bring back our ceph-cluster in a healthy state.