Hello List,
we upgraded from ceph 14.2 to 15.2.15. Still on Debian 10.
The Upgrade went quite smoothly and ended with HEALTH_OK.
2h later some low io problems started and VMs got unresponsive. Right around the snapshot cron jobs...
It looks like the OSDs (SSD) are really busy now:
Those high latency´s force OSDs to be marked as down.
I dont think the SSDs are all broken oversudden.
There is also some snaptrim going on.
"iostat -dx 3" looks good. 0-1 %util.
Not much reading/writing is happening but top shows 100% CPU wait for the OSD Process.
Any idea whats going on here and how i can fix it?
Here is my ceph status:
we upgraded from ceph 14.2 to 15.2.15. Still on Debian 10.
The Upgrade went quite smoothly and ended with HEALTH_OK.
2h later some low io problems started and VMs got unresponsive. Right around the snapshot cron jobs...
It looks like the OSDs (SSD) are really busy now:
Code:
root@cluster5-node01:~# ceph osd perf
osd commit_latency(ms) apply_latency(ms)
4 341 341
11 0 0
15 314 314
10 0 0
16 1293 1293
9 1373 1373
18 568 568
1 986 986
20 0 0
3 377 377
17 2034 2034
6 233 233
5 1960 1960
14 1116 1116
13 1336 1336
12 657 657
8 1364 1364
2 96 96
Those high latency´s force OSDs to be marked as down.
I dont think the SSDs are all broken oversudden.
There is also some snaptrim going on.
"iostat -dx 3" looks good. 0-1 %util.
Not much reading/writing is happening but top shows 100% CPU wait for the OSD Process.
Any idea whats going on here and how i can fix it?
Here is my ceph status:
Code:
root@cluster5-node01:~# ceph -s
cluster:
id: e1153ea5-bb07-4548-83a9-edd8bae3eeec
health: HEALTH_WARN
noout flag(s) set
1 osds down
4 nearfull osd(s)
Degraded data redundancy: 833161/14948934 objects degraded (5.573%), 296 pgs degraded, 193 pgs undersized
1 pool(s) do not have an application enabled
3 pool(s) nearfull
9 daemons have recently crashed
36 slow ops, oldest one blocked for 225 sec, daemons [osd.1,osd.12,osd.13,osd.14,osd.15,osd.17,osd.9] have slow ops.
services:
mon: 3 daemons, quorum cluster5-node01,cluster5-node02,cluster5-node03 (age 112m)
mgr: cluster5-node03(active, since 66m), standbys: cluster5-node02, cluster5-node01
osd: 18 osds: 17 up (since 2m), 18 in (since 9M); 39 remapped pgs
flags noout
data:
pools: 3 pools, 1143 pgs
objects: 4.98M objects, 17 TiB
usage: 53 TiB used, 10 TiB / 63 TiB avail
pgs: 833161/14948934 objects degraded (5.573%)
172492/14948934 objects misplaced (1.154%)
400 active+clean+snaptrim_wait
231 active+clean
177 active+undersized+degraded
164 active+clean+snaptrim
103 active+recovery_wait+degraded
32 active+remapped+backfill_wait
6 active+undersized+degraded+remapped+backfill_wait
5 active+recovery_wait+undersized+degraded
5 active+clean+snaptrim+laggy
4 active+recovery_wait
4 active
3 active+recovering+degraded
2 active+undersized
2 active+recovering+undersized+degraded
2 active+clean+laggy
1 active+recovering
1 active+undersized+remapped+backfill_wait
1 active+clean+snaptrim_wait+laggy
io:
client: 7.9 KiB/s rd, 5.0 KiB/s wr, 1 op/s rd, 1 op/s wr
recovery: 0 B/s, 0 objects/s