Ceph recovery of HDD cluster slow

qwertbert · Dec 8, 2023

Hi,
I have a PVE cluster with 7 hosts of which each host has 2 16tb HDDs.
The HDDs all use NVMEs as DB discs.
There are no running VMs on the HDDs. They are only used as cold storage.

A few days ago I had to swap 2 of these HDDs on PVE1. And since I already had the server open, I added two 16TB discs.
Since then the recovery speed has been terrible.

I tried the following without success and then reset to the default values:

disable delay between recovery operations on hdds
osd_max_backfills changed to different values
osd_recovery_max_active changed to different values

Can someone help a Ceph noob out and have an idea why this is so slow?

Code:

root@pve1:~# ceph status
  cluster:
    id:     f9ffe0dc-5126-4e05-92d9-18018cdae35a
    health: HEALTH_WARN
            nodeep-scrub flag(s) set
            Degraded data redundancy: 27263656/224226776 objects degraded (12.159%), 52 pgs degraded, 59 pgs undersized
            1 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum pve3,pve5,pve1 (age 45h)
    mgr: pve1(active, since 45h)
    mds: 2/2 daemons up, 1 standby
    osd: 44 osds: 44 up (since 45h), 38 in (since 75m); 127 remapped pgs
         flags nodeep-scrub

  data:
    volumes: 2/2 healthy
    pools:   12 pools, 369 pgs
    objects: 39.66M objects, 58 TiB
    usage:   93 TiB used, 209 TiB / 302 TiB avail
    pgs:     27263656/224226776 objects degraded (12.159%)
             48876253/224226776 objects misplaced (21.798%)
             235 active+clean
             50  active+undersized+degraded+remapped+backfill_wait
             45  active+remapped+backfill_wait
             25  active+clean+remapped
             7   active+undersized
             5   active+remapped+backfilling
             2   active+undersized+degraded+remapped+backfilling

  io:
    client:   1.1 MiB/s rd, 29 MiB/s wr, 72 op/s rd, 360 op/s wr
    recovery: 18 MiB/s, 10 objects/s

ness1602 · Dec 8, 2023

What is CEPH version, what disks are used for db/wal?

qwertbert · Dec 8, 2023

ness1602 said:
What is CEPH version, what disks are used for db/wal?

root@pve1:~# ceph --version
ceph version 17.2.7 (e303afc2e967a4705b40a7e5f76067c10eea0484) quincy (stable)

Samsung 990 PRO 1TB as db/wal disks.

sb-jw · Dec 8, 2023

qwertbert said:
sd_max_backfills changed to different values

osd_recovery_max_active changed to different values

What values did you set that to? How long did you wait for something to happen?

Does the pool have replica 2 or replica 3? Does CEPH distribute the data across nodes or OSDs? What network connection do the nodes have? How long has it been running? Was it faster at the beginning?

qwertbert · Dec 8, 2023

sb-jw said:
What values did you set that to? How long did you wait for something to happen?

Does the pool have replica 2 or replica 3? Does CEPH distribute the data across nodes or OSDs? What network connection do the nodes have? How long has it been running? Was it faster at the beginning?

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

I waited several hours, at least a whole night for something to happen.

There is a replica 3 pool and a ec pool on the OSDs.
Ceph distributes across hosts.
10G-Network
Running 3 days now, it was not faster at the beginning.

sb-jw · Dec 8, 2023

qwertbert said:
12 pools, 369 pgs

You have 12 pools but only 369 PGs? Personally, that seems very little to me and could make your problem worse or even be the cause. Your PGs are so naturally incredibly large that it takes a really long time. CEPH usually goes through a lot at the beginning, then it subsides a bit and becomes much slower.

For comparison, one of my CEPHs has 30 OSDs, 1345 PGs with 5 pools (one of which is the metrics pool and two for CephFS). Absolutely no problems, after a reboot the CEPH is healthy again within 2 minutes.

qwertbert said:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

Did you also verify that the value was set that way?

There is also a corresponding value for _ssd and _hdd; you may have already configured something else here, which is why the change has no effect.

https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_recovery_max_active

Search

Search

Ceph recovery of HDD cluster slow

qwertbert

Member

ness1602

Renowned Member

qwertbert

Member

sb-jw

Famous Member

qwertbert

Member

sb-jw

Famous Member