CEPH rebuild issues

arubenstein

New Member
Jul 17, 2023
22
0
1
Greetings. I have an 11 node PVE cluster with CEPH running on all nodes. 4 of the nodes have each 22 x 1.92TB SSD, 7 of the nodes each have 10 x HDD, varying in size from 12 to 16 TB. They are of course split into two classes (ssd, hdd), and there is a pool on each, size/min 3/2 (default). A few weeks ago, we lost a couple HDDs which I then replaced. The issue we are having is that the rebuild has been going on for weeks. In the midst of this, two page-groups have gone "inconsistent" and have been that way for a while as well.

Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.

At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.

Help!

Code:
  cluster:
    id:     6eddcc19-bd51-45da-bbaa-49e9fcaddc85
    health: HEALTH_ERR
            8 scrub errors
            Possible data damage: 2 pgs inconsistent
            4275 pgs not deep-scrubbed in time
            3306 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum ceph1-hyp,ceph7-hyp,ceph9-hyp,ceph3-hyp,ceph5-hyp (age 3w)
    mgr: ceph6-hyp(active, since 3w), standbys: ceph2-hyp, ceph4-hyp
    mds: 1/1 daemons up, 3 standby
    osd: 158 osds: 158 up (since 23h), 158 in (since 23h); 204 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 4409 pgs
    objects: 44.20M objects, 167 TiB
    usage:   516 TiB used, 527 TiB / 1.0 PiB avail
    pgs:     7353441/132602229 objects misplaced (5.545%)
             4088 active+clean
             201  active+remapped+backfill_wait
             80   active+clean+scrubbing
             36   active+clean+scrubbing+deep
             2    active+remapped+backfilling
             1    active+clean+inconsistent
             1    active+remapped+inconsistent+backfill_wait

  io:
    client:   72 KiB/s rd, 26 MiB/s wr, 29 op/s rd, 61 op/s wr
    recovery: 21 MiB/s, 5 objects/s
 
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.
Thats normal. The closer ceph is to completing the recovery, the less targets are available to process. Here are some tunables that may help:

sudo ceph tell osd.* injectargs --osd_recovery_sleep_hdd=0
(probably not an issue but may as well dot the i's)

sudo ceph tell 'osd.*' injectargs --osd-max-backfills=6 --osd-recovery-max-active=3
In my experience, these values are optimal for HDD based pools; increasing values higher only ends up slowing the pool without any tangible benefit.

Now, you will need to repair your broken pgs. I'm lazy so I'll just link the manual page: https://docs.ceph.com/en/latest/rados/operations/pg-repair/

If you want this to happen automatically, you will want to insert

osd_scrub_auto_repair = true

to your [osd] section of ceph.conf (restart all monitors to enable)
 
I should have mentioned that I'd done a considerable amount of research and reading on this over the last couple weeks and have done most of the above tuning that you suggest with no change in recovery rate. It remains astonishing to me that on a cluster with 70 HDDs with almost no activity that only 2-3 objects/second can be recovered, when recovering 4 disks at the same time. Something still seems wrong.

Is there any tools in ceph to determine which OSDs are currently recovering, backfilling, and so forth, and the performance of the OSD thereof?

as to the repair, i've commanded repairs on the two placement groups several times, to no avail.
 
Is there any tools in ceph to determine which OSDs are currently recovering, backfilling, and so forth
not directly, but its a fairly direct deduction looking at your OSD latency to see what OSDs are busy. to see whats keeping an OSD busy,

tail -f /var/log/ceph/ceph-osd.xx.log (where xx is the osd number)


as to the repair, i've commanded repairs on the two placement groups several times, to no avail.
you have thousands of scrubs pending. you can either wait for the pgs in question to process, or stop all scrubs.
 
OK, more. This is what is really perplexing to me: only one PG is backfilling right now. I can't for the life of me understand why that is the case.



Code:
    pgs:     7272558/132317004 objects misplaced (5.496%)
             4099 active+clean
             198  active+remapped+backfill_wait
             70   active+clean+scrubbing
             39   active+clean+scrubbing+deep
             1    active+clean+inconsistent
             1    active+remapped+inconsistent+backfill_wait
             1    active+remapped+backfilling
 
And more. I discovered that a few of the replaced OSD's were not doing any IO. So I restarted them, and viola, backfilling performance has increased. I find this peculiar. The OSDs were not offline or out or anything. Just sitting there. It's like the manager forgot about them or something.

In any event, the cluster is still not honoring the max simultaneous backfills.


Code:
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 4409 pgs
    objects: 44.11M objects, 167 TiB
    usage:   515 TiB used, 528 TiB / 1.0 PiB avail
    pgs:     7261318/132330570 objects misplaced (5.487%)
             4130 active+clean
             194  active+remapped+backfill_wait
             48   active+clean+scrubbing
             31   active+clean+scrubbing+deep
             4    active+remapped+backfilling
             1    active+remapped+inconsistent+backfill_wait
             1    active+clean+inconsistent

  io:
    client:   254 KiB/s rd, 34 MiB/s wr, 68 op/s rd, 131 op/s wr
    recovery: 48 MiB/s, 12 objects/s

PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG   LOG_DUPS  STATE                                       SINCE  VERSION         REPORTED        UP                 ACTING             SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING
30.36c    42960         0      41508        0  178813131264            0           0  2938      3000                 active+remapped+backfilling     6m    38573'460435   38575:1245332       [5,148,29]p5   [148,29,146]p148  2024-03-31T15:52:01.450044-0400  2024-03-24T18:02:36.787487-0400                   36  periodic scrub scheduled @ 2024-05-24T21:04:35.514092-0400
30.462    21225         0      21099        0   88375496192            0           0  4553         0                 active+remapped+backfilling    37s    38573'399987    38575:718263    [24,138,150]p24     [24,138,44]p24  2024-04-03T17:12:47.383832-0400  2024-03-28T21:13:16.352707-0400                   25  periodic scrub scheduled @ 2024-05-02T02:09:23.185236-0400
30.488    21258         0      19787        0   88459260416            0           0  3793         0                 active+remapped+backfilling     6m    38573'419792    38575:854755     [15,142,51]p15    [15,142,151]p15  2024-04-04T09:47:56.085245-0400  2024-03-29T10:10:43.475865-0400                   19  periodic scrub scheduled @ 2024-05-14T11:53:21.277987-0400
30.678    21241         0      39630        0   88400918016            0           0  3763         0                 active+remapped+backfilling     7m    38559'420645    38575:925623       [33,6,16]p33     [20,16,152]p20  2024-03-30T14:36:33.061160-0400  2024-03-30T14:36:33.061160-0400                10595  queued for scrub
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!