Greetings. I have an 11 node PVE cluster with CEPH running on all nodes. 4 of the nodes have each 22 x 1.92TB SSD, 7 of the nodes each have 10 x HDD, varying in size from 12 to 16 TB. They are of course split into two classes (ssd, hdd), and there is a pool on each, size/min 3/2 (default). A few weeks ago, we lost a couple HDDs which I then replaced. The issue we are having is that the rebuild has been going on for weeks. In the midst of this, two page-groups have gone "inconsistent" and have been that way for a while as well.
Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.
Help!
Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.
Help!
Code:
cluster:
id: 6eddcc19-bd51-45da-bbaa-49e9fcaddc85
health: HEALTH_ERR
8 scrub errors
Possible data damage: 2 pgs inconsistent
4275 pgs not deep-scrubbed in time
3306 pgs not scrubbed in time
services:
mon: 5 daemons, quorum ceph1-hyp,ceph7-hyp,ceph9-hyp,ceph3-hyp,ceph5-hyp (age 3w)
mgr: ceph6-hyp(active, since 3w), standbys: ceph2-hyp, ceph4-hyp
mds: 1/1 daemons up, 3 standby
osd: 158 osds: 158 up (since 23h), 158 in (since 23h); 204 remapped pgs
data:
volumes: 1/1 healthy
pools: 5 pools, 4409 pgs
objects: 44.20M objects, 167 TiB
usage: 516 TiB used, 527 TiB / 1.0 PiB avail
pgs: 7353441/132602229 objects misplaced (5.545%)
4088 active+clean
201 active+remapped+backfill_wait
80 active+clean+scrubbing
36 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+inconsistent
1 active+remapped+inconsistent+backfill_wait
io:
client: 72 KiB/s rd, 26 MiB/s wr, 29 op/s rd, 61 op/s wr
recovery: 21 MiB/s, 5 objects/s