Greetings. I have an 11 node PVE cluster with CEPH running on all nodes. 4 of the nodes have each 22 x 1.92TB SSD, 7 of the nodes each have 10 x HDD, varying in size from 12 to 16 TB.  They are of course split into two classes (ssd, hdd), and there is a pool on each, size/min 3/2 (default). A few weeks ago, we lost a couple HDDs which I then replaced. The issue we are having is that the rebuild has been going on for weeks. In the midst of this, two page-groups have gone "inconsistent" and have been that way for a while as well.
Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.
Help!
	
	
	
		
				
			Every attempt I make to speed up recovery (by increasing simultaneous and so forth) has no effect. Considering the number of drives I have and the little activity on this cluster, it's quite alarming to me how slow this recovery is. I feel like I am missing something obvious.
At first the recovery was quick - hundreds of megabytes/sec, hundreds of objects/sec. But has slowly come down to a crawl.
Help!
		Code:
	
	  cluster:
    id:     6eddcc19-bd51-45da-bbaa-49e9fcaddc85
    health: HEALTH_ERR
            8 scrub errors
            Possible data damage: 2 pgs inconsistent
            4275 pgs not deep-scrubbed in time
            3306 pgs not scrubbed in time
  services:
    mon: 5 daemons, quorum ceph1-hyp,ceph7-hyp,ceph9-hyp,ceph3-hyp,ceph5-hyp (age 3w)
    mgr: ceph6-hyp(active, since 3w), standbys: ceph2-hyp, ceph4-hyp
    mds: 1/1 daemons up, 3 standby
    osd: 158 osds: 158 up (since 23h), 158 in (since 23h); 204 remapped pgs
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 4409 pgs
    objects: 44.20M objects, 167 TiB
    usage:   516 TiB used, 527 TiB / 1.0 PiB avail
    pgs:     7353441/132602229 objects misplaced (5.545%)
             4088 active+clean
             201  active+remapped+backfill_wait
             80   active+clean+scrubbing
             36   active+clean+scrubbing+deep
             2    active+remapped+backfilling
             1    active+clean+inconsistent
             1    active+remapped+inconsistent+backfill_wait
  io:
    client:   72 KiB/s rd, 26 MiB/s wr, 29 op/s rd, 61 op/s wr
    recovery: 21 MiB/s, 5 objects/s 
	