Proxmox 5.2-5 + Ceph = Slow recovery after adding new node, osds and increasing pgs

thoff · Sep 17, 2018

Hey,

We are having a slow recovery after adding another node, 10 OSDs and increasing our pgs from 512 to 1024 and we want to know if there is any way to help speed the process up.

This is our environment:

node-a-01
2x 1.92TB SSD
1x 12TB HDD

node-a-02
2x 1.92TB SSD
1x 12TB HDD

node-a-03
2x 1.92TB SSD
1x 12TB HDD

node-a-04
2x 1.92TB SSD
1x 12TB HDD

node-a-05
2x 1.92TB SSD
1x 12TB HDD

node-a-06 < Newly added.
2x 1.92TB SSD < Newly added.
1x 12TB HDD < Newly added.

node-b-01
4x 1.92TB SSD < Newly added.

node-b-02
4x 1.92TB SSD < Newly added.

node-c-01
2x 3.8TB SSD
2x 1.92TB SSD < Newly added.

Code:

  cluster:
    id:     b2f1455f-5ba3-403c-a82a-659aad72638f
    health: HEALTH_ERR
            599520/11645616 objects misplaced (5.148%)
            Reduced data availability: 54 pgs inactive
            2678 slow requests are blocked > 32 sec
            89270 stuck requests are blocked > 4096 sec

  services:
    mon: 17 daemons, quorum *omitted*
    mgr: 4c-03-ceph(active), standbys: *omitted*
    osd: 30 osds: 30 up, 30 in; 76 remapped pgs

  data:
    pools:   1 pools, 1024 pgs
    objects: 3790k objects, 15065 GB
    usage:   45679 GB used, 67874 GB / 110 TB avail
    pgs:     5.273% pgs not active
             599520/11645616 objects misplaced (5.148%)
             948 active+clean
             54  activating+remapped
             16  active+remapped+backfilling
             6   active+remapped+backfill_wait

  io:
    client:   6756 B/s wr, 0 op/s rd, 1 op/s wr
    recovery: 33740 kB/s, 8 objects/s

janos · Sep 17, 2018

Hi,

You can speed up recovery, but your normal IO will be slow after that. IF you want faster recover, increase the number of recovery processes:

Code:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

thoff · Sep 17, 2018

I tried that however that hasn't really changed anything.

janos · Sep 17, 2018

What is the output of

Code:

ceph health detail

command?

thoff · Sep 18, 2018

Code:

HEALTH_ERR 529670/11643771 objects misplaced (4.549%); Reduced data availability: 54 pgs inactive; 2475 slow requests are blocked > 32 sec; 94620 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 529670/11643771 objects misplaced (4.549%)
PG_AVAILABILITY Reduced data availability: 54 pgs inactive
    pg 2.22a is stuck inactive for 92378.384316, current state activating+remapped, last acting [17,20,27]
    pg 2.22c is stuck inactive for 92378.398597, current state activating+remapped, last acting [7,16,18]
    pg 2.22e is stuck inactive for 92378.400341, current state activating+remapped, last acting [14,17,22]
    pg 2.233 is stuck inactive for 92378.384728, current state activating+remapped, last acting [17,22,23]
    pg 2.275 is stuck inactive for 92378.392766, current state activating+remapped, last acting [10,18,29]
    pg 2.296 is stuck inactive for 92378.344993, current state activating+remapped, last acting [18,22,27]
    pg 2.2a5 is stuck inactive for 92378.340397, current state activating+remapped, last acting [18,22,26]
    pg 2.2a7 is stuck inactive for 92379.397013, current state activating+remapped, last acting [18,15,22]
    pg 2.2b5 is stuck inactive for 92378.337107, current state activating+remapped, last acting [18,19,27]
    pg 2.2ba is stuck inactive for 92378.339846, current state activating+remapped, last acting [18,21,27]
    pg 2.2c1 is stuck inactive for 92378.344479, current state activating+remapped, last acting [18,21,23]
    pg 2.2c7 is stuck inactive for 92378.383855, current state activating+remapped, last acting [17,24,19]
    pg 2.2d0 is stuck inactive for 92378.377305, current state activating+remapped, last acting [17,19,27]
    pg 2.2db is stuck inactive for 92378.399461, current state activating+remapped, last acting [7,11,22]
    pg 2.2df is stuck inactive for 92379.408335, current state activating+remapped, last acting [16,22,27]
    pg 2.2eb is stuck inactive for 92378.400998, current state activating+remapped, last acting [9,25,29]
    pg 2.2f2 is stuck inactive for 92378.390344, current state activating+remapped, last acting [13,16,27]
    pg 2.2f5 is stuck inactive for 92378.404985, current state activating+remapped, last acting [3,12,19]
    pg 2.300 is stuck inactive for 92378.376228, current state activating+remapped, last acting [17,19,28]
    pg 2.311 is stuck inactive for 92378.402479, current state activating+remapped, last acting [3,23,26]
    pg 2.314 is stuck inactive for 92378.402519, current state activating+remapped, last acting [3,20,15]
    pg 2.317 is stuck inactive for 92378.402766, current state activating+remapped, last acting [14,22,27]
    pg 2.31b is stuck inactive for 92379.407048, current state activating+remapped, last acting [16,11,19]
    pg 2.31c is stuck inactive for 92378.345336, current state activating+remapped, last acting [18,27,22]
    pg 2.320 is stuck inactive for 92379.392176, current state activating+remapped, last acting [18,12,16]
    pg 2.32f is stuck inactive for 92379.394264, current state activating+remapped, last acting [18,22,27]
    pg 2.335 is stuck inactive for 92378.373388, current state activating+remapped, last acting [17,21,28]
    pg 2.339 is stuck inactive for 92379.405878, current state activating+remapped, last acting [16,19,21]
    pg 2.344 is stuck inactive for 92378.321613, current state activating+remapped, last acting [19,24,27]
    pg 2.346 is stuck inactive for 92378.380092, current state activating+remapped, last acting [16,18,19]
    pg 2.349 is stuck inactive for 92378.401023, current state activating+remapped, last acting [9,22,27]
    pg 2.34c is stuck inactive for 92378.385280, current state activating+remapped, last acting [10,19,16]
    pg 2.350 is stuck inactive for 92378.401728, current state activating+remapped, last acting [4,27,29]
    pg 2.36d is stuck inactive for 92378.335458, current state activating+remapped, last acting [18,22,27]
    pg 2.374 is stuck inactive for 92379.411093, current state activating+remapped, last acting [27,18,21]
    pg 2.380 is stuck inactive for 92378.401611, current state activating+remapped, last acting [9,16,18]
    pg 2.38e is stuck inactive for 92378.403326, current state activating+remapped, last acting [3,18,22]
    pg 2.396 is stuck inactive for 92378.384510, current state activating+remapped, last acting [10,19,28]
    pg 2.3a6 is stuck inactive for 92378.385963, current state activating+remapped, last acting [10,16,21]
    pg 2.3a7 is stuck inactive for 92378.379113, current state activating+remapped, last acting [17,19,20]
    pg 2.3ac is stuck inactive for 92378.343452, current state activating+remapped, last acting [18,19,27]
    pg 2.3b5 is stuck inactive for 92379.386907, current state activating+remapped, last acting [18,12,29]
    pg 2.3b6 is stuck inactive for 92379.400957, current state activating+remapped, last acting [18,22,13]
    pg 2.3b8 is stuck inactive for 92378.401084, current state activating+remapped, last acting [7,16,19]
    pg 2.3b9 is stuck inactive for 92379.409093, current state activating+remapped, last acting [16,13,29]
    pg 2.3c8 is stuck inactive for 92378.384855, current state activating+remapped, last acting [17,19,22]
    pg 2.3cf is stuck inactive for 92378.348326, current state activating+remapped, last acting [18,22,24]
    pg 2.3d6 is stuck inactive for 92378.394656, current state activating+remapped, last acting [10,19,20]
    pg 2.3d8 is stuck inactive for 92378.407224, current state activating+remapped, last acting [8,10,19]
    pg 2.3e1 is stuck inactive for 92378.345559, current state activating+remapped, last acting [18,23,26]
    pg 2.3ea is stuck inactive for 92378.400909, current state activating+remapped, last acting [15,18,22]
REQUEST_SLOW 2475 slow requests are blocked > 32 sec
    1254 ops are blocked > 2097.15 sec
    800 ops are blocked > 1048.58 sec
    222 ops are blocked > 524.288 sec
    107 ops are blocked > 262.144 sec
    53 ops are blocked > 131.072 sec
    27 ops are blocked > 65.536 sec
    12 ops are blocked > 32.768 sec
REQUEST_STUCK 94620 stuck requests are blocked > 4096 sec
    22839 ops are blocked > 134218 sec
    34848 ops are blocked > 67108.9 sec
    19183 ops are blocked > 33554.4 sec
    10044 ops are blocked > 16777.2 sec
    5117 ops are blocked > 8388.61 sec
    2589 ops are blocked > 4194.3 sec
    osds 3,4,7,8,9,10,13,14,15,16,17,18,19,27 have stuck requests > 134218 sec

janos · Sep 18, 2018

Please stop all VM, and try to restart all OSD. First, this will eliminate stuck requests.

thoff · Sep 18, 2018

janos said:
Please stop all VM, and try to restart all OSD. First, this will eliminate stuck requests.

We ended up shutting down as many vms we could, then identified the osds that had the most stuck osds by running

Code:

ceph health detail

and restarted the osd daemons one by one and this ended up resolving the issue.

Search

Search

Proxmox 5.2-5 + Ceph = Slow recovery after adding new node, osds and increasing pgs

thoff

New Member

janos

Well-Known Member

thoff

New Member

janos

Well-Known Member

thoff

New Member

janos

Well-Known Member

thoff

New Member