CEPH rebuild issues

arubenstein · Apr 15, 2024

Greetings. I have an 11 node PVE cluster with CEPH running on all nodes. 4 of the nodes have each 22 x 1.92TB SSD, 7 of the nodes each have 10 x HDD, varying in size from 12 to 16 TB. They are of course split into two classes (ssd, hdd), and there is a pool on each, size/min 3/2 (default). A few weeks ago, we lost a couple HDDs which I then replaced. The issue we are having is that the rebuild has been going on for weeks. In the midst of this, two page-groups have gone "inconsistent" and have been that way for a while as well.

Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.

At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.

Help!

Code:

  cluster:
    id:     6eddcc19-bd51-45da-bbaa-49e9fcaddc85
    health: HEALTH_ERR
            8 scrub errors
            Possible data damage: 2 pgs inconsistent
            4275 pgs not deep-scrubbed in time
            3306 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum ceph1-hyp,ceph7-hyp,ceph9-hyp,ceph3-hyp,ceph5-hyp (age 3w)
    mgr: ceph6-hyp(active, since 3w), standbys: ceph2-hyp, ceph4-hyp
    mds: 1/1 daemons up, 3 standby
    osd: 158 osds: 158 up (since 23h), 158 in (since 23h); 204 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 4409 pgs
    objects: 44.20M objects, 167 TiB
    usage:   516 TiB used, 527 TiB / 1.0 PiB avail
    pgs:     7353441/132602229 objects misplaced (5.545%)
             4088 active+clean
             201  active+remapped+backfill_wait
             80   active+clean+scrubbing
             36   active+clean+scrubbing+deep
             2    active+remapped+backfilling
             1    active+clean+inconsistent
             1    active+remapped+inconsistent+backfill_wait

  io:
    client:   72 KiB/s rd, 26 MiB/s wr, 29 op/s rd, 61 op/s wr
    recovery: 21 MiB/s, 5 objects/s

alexskysilk · Apr 15, 2024

arubenstein said:
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.

Thats normal. The closer ceph is to completing the recovery, the less targets are available to process. Here are some tunables that may help:

sudo ceph tell osd.* injectargs --osd_recovery_sleep_hdd=0
(probably not an issue but may as well dot the i's)

sudo ceph tell 'osd.*' injectargs --osd-max-backfills=6 --osd-recovery-max-active=3
In my experience, these values are optimal for HDD based pools; increasing values higher only ends up slowing the pool without any tangible benefit.

Now, you will need to repair your broken pgs. I'm lazy so I'll just link the manual page: https://docs.ceph.com/en/latest/rados/operations/pg-repair/

If you want this to happen automatically, you will want to insert

osd_scrub_auto_repair = true

to your [osd] section of ceph.conf (restart all monitors to enable)

arubenstein · Apr 15, 2024

I should have mentioned that I'd done a considerable amount of research and reading on this over the last couple weeks and have done most of the above tuning that you suggest with no change in recovery rate. It remains astonishing to me that on a cluster with 70 HDDs with almost no activity that only 2-3 objects/second can be recovered, when recovering 4 disks at the same time. Something still seems wrong.

Is there any tools in ceph to determine which OSDs are currently recovering, backfilling, and so forth, and the performance of the OSD thereof?

as to the repair, i've commanded repairs on the two placement groups several times, to no avail.

alexskysilk · Apr 15, 2024

arubenstein said:
Is there any tools in ceph to determine which OSDs are currently recovering, backfilling, and so forth

not directly, but its a fairly direct deduction looking at your OSD latency to see what OSDs are busy. to see whats keeping an OSD busy,

tail -f /var/log/ceph/ceph-osd.xx.log (where xx is the osd number)

arubenstein said:
as to the repair, i've commanded repairs on the two placement groups several times, to no avail.

you have thousands of scrubs pending. you can either wait for the pgs in question to process, or stop all scrubs.

arubenstein · Apr 15, 2024

OK, more. This is what is really perplexing to me: only one PG is backfilling right now. I can't for the life of me understand why that is the case.

Code:

    pgs:     7272558/132317004 objects misplaced (5.496%)
             4099 active+clean
             198  active+remapped+backfill_wait
             70   active+clean+scrubbing
             39   active+clean+scrubbing+deep
             1    active+clean+inconsistent
             1    active+remapped+inconsistent+backfill_wait
             1    active+remapped+backfilling

arubenstein · Apr 15, 2024

And more. I discovered that a few of the replaced OSD's were not doing any IO. So I restarted them, and viola, backfilling performance has increased. I find this peculiar. The OSDs were not offline or out or anything. Just sitting there. It's like the manager forgot about them or something.

In any event, the cluster is still not honoring the max simultaneous backfills.

Code:

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 4409 pgs
    objects: 44.11M objects, 167 TiB
    usage:   515 TiB used, 528 TiB / 1.0 PiB avail
    pgs:     7261318/132330570 objects misplaced (5.487%)
             4130 active+clean
             194  active+remapped+backfill_wait
             48   active+clean+scrubbing
             31   active+clean+scrubbing+deep
             4    active+remapped+backfilling
             1    active+remapped+inconsistent+backfill_wait
             1    active+clean+inconsistent

  io:
    client:   254 KiB/s rd, 34 MiB/s wr, 68 op/s rd, 131 op/s wr
    recovery: 48 MiB/s, 12 objects/s

PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG   LOG_DUPS  STATE                                       SINCE  VERSION         REPORTED        UP                 ACTING             SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING
30.36c    42960         0      41508        0  178813131264            0           0  2938      3000                 active+remapped+backfilling     6m    38573'460435   38575:1245332       [5,148,29]p5   [148,29,146]p148  2024-03-31T15:52:01.450044-0400  2024-03-24T18:02:36.787487-0400                   36  periodic scrub scheduled @ 2024-05-24T21:04:35.514092-0400
30.462    21225         0      21099        0   88375496192            0           0  4553         0                 active+remapped+backfilling    37s    38573'399987    38575:718263    [24,138,150]p24     [24,138,44]p24  2024-04-03T17:12:47.383832-0400  2024-03-28T21:13:16.352707-0400                   25  periodic scrub scheduled @ 2024-05-02T02:09:23.185236-0400
30.488    21258         0      19787        0   88459260416            0           0  3793         0                 active+remapped+backfilling     6m    38573'419792    38575:854755     [15,142,51]p15    [15,142,151]p15  2024-04-04T09:47:56.085245-0400  2024-03-29T10:10:43.475865-0400                   19  periodic scrub scheduled @ 2024-05-14T11:53:21.277987-0400
30.678    21241         0      39630        0   88400918016            0           0  3763         0                 active+remapped+backfilling     7m    38559'420645    38575:925623       [33,6,16]p33     [20,16,152]p20  2024-03-30T14:36:33.061160-0400  2024-03-30T14:36:33.061160-0400                10595  queued for scrub

plbertrand · Oct 14, 2024

I'm in the same exact situation. I have NVME, SSD and HDD disks but it's simply not using any of the resources available. I've done the same commands that are written everywhere to no avail and I'm baffled at how slow pg move/fix. Is there something obvious that we are missing?

I tried:
ceph config set osd osd_mclock_override_recovery_settings true

which it seems necessary to have faster recovery but it didn't do anything. I also had the same problem as @arubenstein where some OSD stopped responding and everything stalled. I had to play whack a mole to restart them until it resumed.

Is recovery not a well tested/profiled path?

Search

Search

CEPH rebuild issues

arubenstein

New Member

alexskysilk

Distinguished Member

arubenstein

New Member

alexskysilk

Distinguished Member

arubenstein

New Member

arubenstein

New Member

plbertrand

New Member

We value your privacy