[SOLVED] Ceph periodically stalls read IOPS after upgrade to PVE 8.4.13 from 7.4.20

Antohoza

Active Member
Apr 15, 2019
2
0
41
41
Hi all,

Currently have the strangest behaviour in my cluster after upgrade, it periodically drops read IOPS to basically nothing, and the behaviour started when upgrading from proxmox 7.4.20 to 8.4.13 (ceph still left at quincy) untill after all nodes was uppgraded to 8 thinking it would resolve itself after upgrade to Reef

connectivity have been tested with iperf3 and i am able to get full bandwith between nodes (10 Gigabit) between all nodes.

pveversion output:

proxmox-ve: 8.4.0 (running kernel: 6.8.12-15-pve)
pve-manager: 8.4.13 (running version: 8.4.13/5b08ebc2823dd9cb)
proxmox-kernel-helper: 8.1.4
pve-kernel-5.15: 7.4-10
proxmox-kernel-6.8: 6.8.12-15
proxmox-kernel-6.8.12-15-pve-signed: 6.8.12-15
pve-kernel-5.4: 6.4-20
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.15.136-1-pve: 5.15.136-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.2
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.2
libpve-cluster-perl: 8.1.2
libpve-common-perl: 8.3.4
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.7
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.6-1
proxmox-backup-file-restore: 3.4.6-1
proxmox-backup-restore-image: 0.7.0
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.4
proxmox-mail-forward: 0.3.3
proxmox-mini-journalreader: 1.5
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.13
pve-cluster: 8.1.2
pve-container: 5.3.2
pve-docs: 8.4.1
pve-edk2-firmware: 4.2025.02-4~bpo12+1
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.16-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.5
pve-qemu-kvm: 9.2.0-7
pve-xtermjs: 5.5.0-2
qemu-server: 8.4.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.8-pve1
 
Seems like it was a feature in Ceph that it blocked slow ops, this was rectified by adding another dedicated network for ceph OSD