Hi,
ceph health reports
1 MDSs report slow metadata IOs
1 MDSs report slow requests
This is the complete output of ceph -s:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 MDSs report slow metadata IOs
1 MDSs report slow requests
72 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 33 pgs inactive, 32 pgs peering
Degraded data redundancy: 123285/153918525 objects degraded
(0.080%), 27 pgs degraded, 27 pgs undersized
Degraded data redundancy (low space): 116 pgs backfill_toofull
3 pools have too many placement groups
54 slow requests are blocked > 32 sec
179 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 21h)
mgr: ld5507(active, since 21h), standbys: ld5506, ld5505
mds: pve_cephfs:1 {0=ld3955=up:active} 1 up:standby
osd: 368 osds: 368 up, 368 in; 140 remapped pgs
data:
pools: 6 pools, 8872 pgs
objects: 51.31M objects, 196 TiB
usage: 591 TiB used, 561 TiB / 1.1 PiB avail
pgs: 0.372% pgs not active
123285/153918525 objects degraded (0.080%)
621911/153918525 objects misplaced (0.404%)
8714 active+clean
90 active+remapped+backfill_toofull
26 active+undersized+degraded+remapped+backfill_toofull
16 peering
16 remapped+peering
7 active+remapped+backfill_wait
1 activating
1 active+recovery_wait+degraded
1 active+recovery_wait+undersized+remapped
In the log I find these relevant entries:
2019-09-24 13:24:37.073695 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18618.873983 secs
2019-09-24 13:24:42.073757 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18623.874055 secs
2019-09-24 13:24:47.073852 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18628.874149 secs
2019-09-24 13:24:52.073941 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18633.874237 secs
2019-09-24 13:24:57.074073 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18638.874354 secs
2019-09-24 13:25:02.074118 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18643.874415 secs
Cephfs is residing on a pool "hdd" with dedicated HDDs (4x 17 1.6TB).
This pool is used for RBDs, too.
The output of ceph daemon mds.ld3955 objecter_requests shows that only 4 OSDs are affected:
8, 9, 38, 75, 187
When I compare this to the output of ceph health detail exactly the same OSDs are listed with REQUEST_SLOW or REQUEST_STUCK:
REQUEST_SLOW 85 slow requests are blocked > 32 sec
33 ops are blocked > 2097.15 sec
25 ops are blocked > 1048.58 sec
12 ops are blocked > 524.288 sec
1 ops are blocked > 262.144 sec
4 ops are blocked > 131.072 sec
10 ops are blocked > 65.536 sec
osd.68 has blocked requests > 262.144 sec
osds 8,9 have blocked requests > 1048.58 sec
osd.63 has blocked requests > 2097.15 sec
REQUEST_STUCK 224 stuck requests are blocked > 4096 sec
50 ops are blocked > 33554.4 sec
80 ops are blocked > 16777.2 sec
63 ops are blocked > 8388.61 sec
31 ops are blocked > 4194.3 sec
osds 75,187 have stuck requests > 8388.61 sec
osd.38 has stuck requests > 33554.4 sec
Question:
How can I identify the 2 slow requests?
And how can I kill these requests?
Regards
Thomas
ceph health reports
1 MDSs report slow metadata IOs
1 MDSs report slow requests
This is the complete output of ceph -s:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 MDSs report slow metadata IOs
1 MDSs report slow requests
72 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 33 pgs inactive, 32 pgs peering
Degraded data redundancy: 123285/153918525 objects degraded
(0.080%), 27 pgs degraded, 27 pgs undersized
Degraded data redundancy (low space): 116 pgs backfill_toofull
3 pools have too many placement groups
54 slow requests are blocked > 32 sec
179 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 21h)
mgr: ld5507(active, since 21h), standbys: ld5506, ld5505
mds: pve_cephfs:1 {0=ld3955=up:active} 1 up:standby
osd: 368 osds: 368 up, 368 in; 140 remapped pgs
data:
pools: 6 pools, 8872 pgs
objects: 51.31M objects, 196 TiB
usage: 591 TiB used, 561 TiB / 1.1 PiB avail
pgs: 0.372% pgs not active
123285/153918525 objects degraded (0.080%)
621911/153918525 objects misplaced (0.404%)
8714 active+clean
90 active+remapped+backfill_toofull
26 active+undersized+degraded+remapped+backfill_toofull
16 peering
16 remapped+peering
7 active+remapped+backfill_wait
1 activating
1 active+recovery_wait+degraded
1 active+recovery_wait+undersized+remapped
In the log I find these relevant entries:
2019-09-24 13:24:37.073695 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18618.873983 secs
2019-09-24 13:24:42.073757 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18623.874055 secs
2019-09-24 13:24:47.073852 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18628.874149 secs
2019-09-24 13:24:52.073941 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18633.874237 secs
2019-09-24 13:24:57.074073 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18638.874354 secs
2019-09-24 13:25:02.074118 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18643.874415 secs
Cephfs is residing on a pool "hdd" with dedicated HDDs (4x 17 1.6TB).
This pool is used for RBDs, too.
The output of ceph daemon mds.ld3955 objecter_requests shows that only 4 OSDs are affected:
8, 9, 38, 75, 187
When I compare this to the output of ceph health detail exactly the same OSDs are listed with REQUEST_SLOW or REQUEST_STUCK:
REQUEST_SLOW 85 slow requests are blocked > 32 sec
33 ops are blocked > 2097.15 sec
25 ops are blocked > 1048.58 sec
12 ops are blocked > 524.288 sec
1 ops are blocked > 262.144 sec
4 ops are blocked > 131.072 sec
10 ops are blocked > 65.536 sec
osd.68 has blocked requests > 262.144 sec
osds 8,9 have blocked requests > 1048.58 sec
osd.63 has blocked requests > 2097.15 sec
REQUEST_STUCK 224 stuck requests are blocked > 4096 sec
50 ops are blocked > 33554.4 sec
80 ops are blocked > 16777.2 sec
63 ops are blocked > 8388.61 sec
31 ops are blocked > 4194.3 sec
osds 75,187 have stuck requests > 8388.61 sec
osd.38 has stuck requests > 33554.4 sec
Question:
How can I identify the 2 slow requests?
And how can I kill these requests?
Regards
Thomas