proxmox ceph slow ops, oldest one blocked for 2531 sec

berkaybulut

Member
Feb 8, 2023
19
0
6
Hello,
Yesterday I updated all the hosts in my proxmox cluster. After that, after restarting the osds one by one for the new version, the client io in my ceph cluster almost stopped. There is no problem on the network side and disk health. Restarting all ceph services and hosts did not solve the problem.

proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-7-pve-signed: 6.8.12-7
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-3-pve-signed: 6.8.12-3
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 18.2.6-pve1
ceph-fuse: 18.2.6-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.0.10.0/24
fsid = 9319dafb-3408-46cb-9b09-b3d381114545
mon_allow_pool_delete = true
mon_host = X.X.X.1 X.X.X.2 X.X.X.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 1
osd_pool_default_size = 2
public_network = X.X.X.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.cmt6770]
host = cmt6770
mds_standby_for_name = pve

[mds.dc2943]
host = dc2943
mds_standby_for_name = pve

[mds.dc3658]
host = dc3658
mds_standby_for_name = pve

[mon.cmt5923]
public_addr = X.X.X.3

[mon.cmt6770]
public_addr = X.X.X.1

[mon.dc3658]
public_addr = X.X.X.2

root@cmt7773:~# ceph -s
cluster:
id: 9319dafb-3408-46cb-9b09-b3d381114545
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 MDSs report slow requests
1 nearfull osd(s)
Reduced data availability: 12 pgs incomplete
Degraded data redundancy: 65830/6950657 objects degraded (0.947%), 25 pgs degraded, 27 pgs undersized
11 pool(s) nearfull
812 slow ops, oldest one blocked for 2717 sec, daemons [osd.17,osd.18,osd.29,osd.3,osd.9] have slow ops.

services:
mon: 3 daemons, quorum cmt6770,dc3658,cmt5923 (age 15m)
mgr: cmt6461(active, since 7m), standbys: dc3658, cmt6770
mds: 1/1 daemons up, 1 standby
osd: 25 osds: 25 up (since 15m), 25 in (since 59m); 479 remapped pgs

data:
volumes: 1/1 healthy
pools: 12 pools, 1589 pgs
objects: 3.48M objects, 13 TiB
usage: 22 TiB used, 12 TiB / 34 TiB avail
pgs: 0.755% pgs not active
65830/6950657 objects degraded (0.947%)
1143278/6950657 objects misplaced (16.448%)
1097 active+clean
451 active+remapped+backfill_wait
24 active+undersized+degraded+remapped+backfill_wait
12 incomplete
1 active+undersized+remapped+backfill_wait
1 active+clean+scrubbing
1 active+undersized+degraded+remapped+backfilling
1 active+undersized+remapped+backfilling
1 active+remapped+backfilling

io:
client: 74 MiB/s rd, 2.4 MiB/s wr, 1.33k op/s rd, 344 op/s wr
recovery: 276 MiB/s, 73 objects/s

root@cmt7773:~#
root@cmt7773:~# ceph health
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 nearfull osd(s); Reduced data availability: 12 pgs incomplete; Degraded data redundancy: 65550/6950657 objects degraded (0.943%), 25 pgs degraded, 27 pgs undersized; 11 pool(s) nearfull; 812 slow ops, oldest one blocked for 2737 sec, daemons [osd.17,osd.18,osd.29,osd.3,osd.9] have slow ops.
root@cmt7773:~# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 nearfull osd(s); Reduced data availability: 12 pgs incomplete; Degraded data redundancy: 65550/6950657 objects degraded (0.943%), 25 pgs degraded, 27 pgs undersized; 11 pool(s) nearfull; 812 slow ops, oldest one blocked for 2737 sec, daemons [osd.17,osd.18,osd.29,osd.3,osd.9] have slow ops.
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.dc3658(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 513 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.dc3658(mds.0): 102 slow requests are blocked > 30 secs
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
osd.24 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs incomplete
pg 7.25e is incomplete, acting [17,8]
pg 7.26a is incomplete, acting [3,4]
pg 7.2fb is incomplete, acting [17,8]
pg 7.36e is incomplete, acting [9,4]
pg 7.3ab is incomplete, acting [9,0]
pg 13.c is incomplete, acting [8,1]
pg 19.16 is incomplete, acting [9,4]
pg 19.27 is incomplete, acting [9,7]
pg 19.56 is incomplete, acting [9,4]
pg 19.5f is incomplete, acting [18,8]
pg 19.67 is incomplete, acting [9,7]
pg 20.17 is incomplete, acting [29,9]

[WRN] SLOW_OPS: 812 slow ops, oldest one blocked for 2737 sec, daemons [osd.17,osd.18,osd.29,osd.3,osd.9] have slow ops.
root@cmt7773:~#

When I close the slow OSDs, this time other OSDs give the slow error.

root@cmt7773:~# ceph pg stat
1589 pgs: 1 active+remapped+backfilling, 1 active+undersized+remapped+backfilling, 1 active+undersized+degraded+remapped+backfilling, 1 active+clean+scrubbing, 1 active+undersized+remapped+backfill_wait, 12 incomplete, 451 active+remapped+backfill_wait, 1096 active+clean, 1 active+clean+scrubbing+deep, 24 active+undersized+degraded+remapped+backfill_wait; 13 TiB data, 22 TiB used, 12 TiB / 34 TiB avail; 65 MiB/s rd, 2.7 MiB/s wr, 1.48k op/s; 65017/6950659 objects degraded (0.935%); 1140661/6950659 objects misplaced (16.411%); 173 MiB/s, 0 keys/s, 45 objects/s recovering
root@cmt7773:~#

root@cmt7773:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
3 ssd 1.74660 1.00000 1.7 TiB 1.2 TiB 1.2 TiB 294 KiB 2.4 GiB 509 GiB 71.53 1.10 160 up
8 ssd 1.81940 1.00000 1.8 TiB 179 GiB 178 GiB 16 KiB 855 MiB 1.6 TiB 9.60 0.15 26 up
9 ssd 1.81940 1.00000 1.8 TiB 260 GiB 259 GiB 21 KiB 1010 MiB 1.6 TiB 13.94 0.21 42 up
24 nvme 0.90970 1.00000 932 GiB 794 GiB 793 GiB 143 KiB 1.3 GiB 137 GiB 85.28 1.31 122 up
2 ssd 1.74660 0.95001 1.7 TiB 1.4 TiB 1.4 TiB 258 KiB 2.4 GiB 319 GiB 82.15 1.26 174 up
26 ssd 1.74660 0.95001 1.7 TiB 1.2 TiB 1.2 TiB 197 KiB 2.1 GiB 516 GiB 71.17 1.09 174 up
4 nvme 1.86299 1.00000 1.9 TiB 1.5 TiB 1.5 TiB 328 KiB 2.5 GiB 354 GiB 81.42 1.25 250 up
0 ssd 0.87329 1.00000 894 GiB 316 GiB 315 GiB 35 KiB 796 MiB 578 GiB 35.33 0.54 39 up
1 ssd 0.87329 1.00000 894 GiB 192 GiB 192 GiB 64 KiB 459 MiB 702 GiB 21.53 0.33 24 up
17 ssd 0.87329 0.85675 894 GiB 685 GiB 684 GiB 165 KiB 1.2 GiB 209 GiB 76.65 1.18 88 up
18 ssd 0.87329 0.90002 894 GiB 669 GiB 668 GiB 143 KiB 1.1 GiB 225 GiB 74.81 1.15 88 up
5 nvme 1.81940 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 254 KiB 2.2 GiB 587 GiB 68.50 1.05 208 up
19 nvme 1.81940 0.95001 1.8 TiB 1.4 TiB 1.4 TiB 298 KiB 2.3 GiB 435 GiB 76.67 1.18 215 up
7 ssd 1.74660 1.00000 1.7 TiB 1.3 TiB 1.3 TiB 193 KiB 2.1 GiB 482 GiB 73.07 1.12 166 up
29 ssd 1.86299 1.00000 1.9 TiB 1.3 TiB 1.3 TiB 332 KiB 2.1 GiB 596 GiB 68.77 1.06 181 up
22 nvme 0.90970 1.00000 932 GiB 661 GiB 660 GiB 178 KiB 1.1 GiB 271 GiB 70.92 1.09 101 up
23 nvme 0.90970 1.00000 932 GiB 622 GiB 621 GiB 113 KiB 1.1 GiB 310 GiB 66.74 1.02 104 up
6 ssd 1.74660 1.00000 1.7 TiB 1.3 TiB 1.3 TiB 242 KiB 2.2 GiB 418 GiB 76.60 1.18 170 up
12 ssd 0.87329 1.00000 894 GiB 729 GiB 728 GiB 101 KiB 1.2 GiB 166 GiB 81.49 1.25 89 up
13 ssd 0.87329 1.00000 894 GiB 570 GiB 569 GiB 120 KiB 1.0 GiB 324 GiB 63.78 0.98 79 up
14 ssd 0.87329 1.00000 894 GiB 587 GiB 586 GiB 124 KiB 1.0 GiB 307 GiB 65.63 1.01 80 up
15 ssd 0.87329 1.00000 894 GiB 542 GiB 541 GiB 137 KiB 955 MiB 352 GiB 60.60 0.93 75 up
16 ssd 0.87329 1.00000 894 GiB 670 GiB 669 GiB 97 KiB 1.1 GiB 224 GiB 74.94 1.15 83 up
20 nvme 1.86299 0.95001 1.9 TiB 1.4 TiB 1.4 TiB 274 KiB 2.1 GiB 495 GiB 74.04 1.14 227 up
21 nvme 1.86299 0.90002 1.9 TiB 1.5 TiB 1.5 TiB 369 KiB 2.4 GiB 340 GiB 82.19 1.26 236 up
TOTAL 34 TiB 22 TiB 22 TiB 4.4 MiB 39 GiB 12 TiB 65.17
MIN/MAX VAR: 0.15/1.31 STDDEV: 21.03
root@cmt7773:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 34.05125 root default
-3 5.38539 host cmt5923
3 ssd 1.74660 osd.3 up 1.00000 1.00000
8 ssd 1.81940 osd.8 up 1.00000 1.00000
9 ssd 1.81940 osd.9 up 1.00000 1.00000
-15 4.40289 host cmt6461
24 nvme 0.90970 osd.24 up 1.00000 1.00000
2 ssd 1.74660 osd.2 up 0.95001 1.00000
26 ssd 1.74660 osd.26 up 0.95001 1.00000
-5 5.35616 host cmt6770
4 nvme 1.86299 osd.4 up 1.00000 1.00000
0 ssd 0.87329 osd.0 up 1.00000 1.00000
1 ssd 0.87329 osd.1 up 1.00000 1.00000
17 ssd 0.87329 osd.17 up 0.85675 1.00000
18 ssd 0.87329 osd.18 up 0.90002 1.00000
-9 7.24838 host cmt7773
5 nvme 1.81940 osd.5 up 1.00000 1.00000
19 nvme 1.81940 osd.19 up 0.95001 1.00000
7 ssd 1.74660 osd.7 up 1.00000 1.00000
29 ssd 1.86299 osd.29 up 1.00000 1.00000
-13 7.93245 host dc2943
22 nvme 0.90970 osd.22 up 1.00000 1.00000
23 nvme 0.90970 osd.23 up 1.00000 1.00000
6 ssd 1.74660 osd.6 up 1.00000 1.00000
12 ssd 0.87329 osd.12 up 1.00000 1.00000
13 ssd 0.87329 osd.13 up 1.00000 1.00000
14 ssd 0.87329 osd.14 up 1.00000 1.00000
15 ssd 0.87329 osd.15 up 1.00000 1.00000
16 ssd 0.87329 osd.16 up 1.00000 1.00000
-11 3.72598 host dc3658
20 nvme 1.86299 osd.20 up 0.95001 1.00000
21 nvme 1.86299 osd.21 up 0.90002 1.00000
root@cmt7773:~#

Currently, client io has almost stopped. I need help.
 
also crush map:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class nvme
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 26 osd.26 class ssd
device 29 osd.29 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cmt5923 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
id -17 class nvme # do not change unnecessarily
# weight 5.38539
alg straw2
hash 0 # rjenkins1
item osd.3 weight 1.74660
item osd.8 weight 1.81940
item osd.9 weight 1.81940
}
host cmt6770 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
id -18 class nvme # do not change unnecessarily
# weight 5.35616
alg straw2
hash 0 # rjenkins1
item osd.17 weight 0.87329
item osd.18 weight 0.87329
item osd.4 weight 1.86299
item osd.1 weight 0.87329
item osd.0 weight 0.87329
}
host cmt7773 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
id -20 class nvme # do not change unnecessarily
# weight 7.24838
alg straw2
hash 0 # rjenkins1
item osd.5 weight 1.81940
item osd.19 weight 1.81940
item osd.7 weight 1.74660
item osd.29 weight 1.86299
}
host dc3658 {
id -11 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
id -21 class nvme # do not change unnecessarily
# weight 3.72598
alg straw2
hash 0 # rjenkins1
item osd.20 weight 1.86299
item osd.21 weight 1.86299
}
host dc2943 {
id -13 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
id -22 class nvme # do not change unnecessarily
# weight 7.93245
alg straw2
hash 0 # rjenkins1
item osd.6 weight 1.74660
item osd.12 weight 0.87329
item osd.13 weight 0.87329
item osd.14 weight 0.87329
item osd.15 weight 0.87329
item osd.16 weight 0.87329
item osd.22 weight 0.90970
item osd.23 weight 0.90970
}
host cmt6461 {
id -15 # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
id -23 class nvme # do not change unnecessarily
# weight 4.40289
alg straw2
hash 0 # rjenkins1
item osd.24 weight 0.90970
item osd.26 weight 1.74660
item osd.2 weight 1.74660
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
id -24 class nvme # do not change unnecessarily
# weight 34.05125
alg straw2
hash 0 # rjenkins1
item cmt5923 weight 5.38539
item cmt6770 weight 5.35616
item cmt7773 weight 7.24838
item dc3658 weight 3.72598
item dc2943 weight 7.93245
item cmt6461 weight 4.40289
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule nvme-only {
id 1
type replicated
step take default class nvme
step chooseleaf firstn 0 type host
step emit
}
rule ssd-only {
id 2
type replicated
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}

# end crush map
I suspect the following incomplete pgs. Even if the entire cluster is cleaned, these incomplete pgs are not repaired. I am thinking of reproducing them.

Reduced data availability: 12 pgs incomplete
pg 7.25e is incomplete, acting [17,8]
pg 7.26a is incomplete, acting [3,4]
pg 7.2fb is incomplete, acting [17,8]
pg 7.36e is incomplete, acting [9,4]
pg 7.3ab is incomplete, acting [9,0]
pg 13.c is incomplete, acting [8,1]
pg 19.16 is incomplete, acting [9,4]
pg 19.27 is incomplete, acting [9,7]
pg 19.56 is incomplete, acting [9,4]
pg 19.5f is incomplete, acting [18,8]
pg 19.67 is incomplete, acting [9,7]
pg 20.17 is incomplete, acting [29,9]
 
Last edited:
May 10 15:38:32 cmt5923 systemd[1]: ceph-osd@9.service: Failed with result 'signal'.
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.579+0300 764caf13f880 -1 osd.8 100504 log_to_monitors true
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 764c8c64b6c0 -1 log_channel(cluster) log [ERR] : 7.26a past_intervals [96946,100253) start interval does not contain the required bound [93903,100253) start
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 764c8c64b6c0 -1 osd.8 pg_epoch: 100377 pg[7.26a( empty local-lis/les=0/0 n=0 ec=96946/96946 lis/c=96236/93898 les/c/f=96237/93903/91308 sis=100253) [3,1] r=-1 lpr=100376 pi=[96946,100253)/3 crt=0'0 mlcod 0'0 unknown mbc={}] PeeringState::check_past_interval_bounds 7.26a past_intervals [96946,100253) start interval does not contain the required bound [93903,100253) start