Ceph - VM with high IO wait

tchaikov · Mar 31, 2026

I concur with @spirit. Also, I double checked the Ceph's source code, actually all ceph daemons (including OSDs) and Ceph client respect this setting. The call chain looks like:

-- global_init() ->
---- crush_location.init_on_startup() ->
----- update_from_hook() or use the default host=<hostname>; root=default, if neither `crush_location` nor `crush_location_hook` is specified.
------ start a subprocess calling the script like

Bash:

$hook --cluster $cluster_name --id $id --type $type

where $type is the ceph application's entity type and $id is the entity id, and, if the entity is 'client.admin', then they are client and admin respectively, if it's osd.0, they are osd and 0.

Alwin Antreich · Mar 31, 2026

spirit said:
It's really done when the qemu process is starting when librbd is initialized.

Good to know.

spirit said:
so the most clean way is through the hook. (it's a simply bash script anyway, qemu also call more complex scripts for network bridge, it's not really a problem).

That's true, but I think it would be more optimized to have the node call it after ceph is up to add the crush_location, instead of each VM/CT separately.
Anyway, just my two cents.

djsami · Mar 31, 2026

Phreak said:
Hello everyone,

I have spent a lot of time to figure out what is provoking that IO wait. on this cluster VM who do a high amount of IO have a lot of IO wait ( like 30k I/O read, 50% IO wait)

Summary of my setup:

PVE

Code:

proxmox-ve: 9.0.0 (running kernel: 6.17.2-2-pve) pve-manager: 9.0.18 (running version: 9.0.18/5cacb35d7ee87217) proxmox-kernel-helper: 9.0.4 proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2 proxmox-kernel-6.17: 6.17.2-2 proxmox-kernel-6.8: 6.8.12-17 proxmox-kernel-6.8.12-17-pve-signed: 6.8.12-17 amd64-microcode: 3.20250311.1 ceph: 19.2.3-pve4 ceph-fuse: 19.2.3-pve4 corosync: 3.1.9-pve2 criu: 4.1.1-1 frr-pythontools: 10.3.1-1+pve4 ifupdown2: 3.3.0-1+pmx11 intel-microcode: 3.20251111.1~deb13u1 libjs-extjs: 7.0.0-5 libproxmox-acme-perl: 1.7.0 libproxmox-backup-qemu0: 2.0.1 libproxmox-rs-perl: 0.4.1 libpve-access-control: 9.0.4 libpve-apiclient-perl: 3.4.2 libpve-cluster-api-perl: 9.0.7 libpve-cluster-perl: 9.0.7 libpve-common-perl: 9.0.15 libpve-guest-common-perl: 6.0.2 libpve-http-server-perl: 6.0.5 libpve-network-perl: 1.2.3 libpve-rs-perl: 0.11.3 libpve-storage-perl: 9.0.18 libspice-server1: 0.15.2-1+b1 lvm2: 2.03.31-2+pmx1 lxc-pve: 6.0.5-3 lxcfs: 6.0.4-pve1 novnc-pve: 1.6.0-3 openvswitch-switch: 3.5.0-1+b1 proxmox-backup-client: 4.1.0-1 proxmox-backup-file-restore: 4.1.0-1 proxmox-backup-restore-image: 1.0.0 proxmox-firewall: 1.2.1 proxmox-kernel-helper: 9.0.4 proxmox-mail-forward: 1.0.2 proxmox-mini-journalreader: 1.6 proxmox-offline-mirror-helper: 0.7.3 proxmox-widget-toolkit: 5.1.2 pve-cluster: 9.0.7 pve-container: 6.0.18 pve-docs: 9.0.9 pve-edk2-firmware: not correctly installed pve-esxi-import-tools: 1.0.1 pve-firewall: 6.0.4 pve-firmware: 3.17-2 pve-ha-manager: 5.0.8 pve-i18n: 3.6.4 pve-qemu-kvm: 10.1.2-4 pve-xtermjs: 5.5.0-3 qemu-server: 9.0.30 smartmontools: 7.4-pve1 spiceterm: 3.4.1 swtpm: 0.8.0+pve3 vncterm: 1.9.1 zfsutils-linux: 2.3.4-pve1

Hardware

I have 6 servers on a 3AZ (OVH) setup (2 servers per AZ) with this hardware

AMD EPYC GENOA 9354 ( 32C/64T )

512GO DDR5

6 x nvme enterprise grade 7To for OSD

4 x 25Gb network card ( mellanox ) bounded together

Network

I have no other choice to bound the 4 25G port for thaht i ma using openv-switch with balance-tcp mode

View attachment 96612

I have a dedicated interface/vlan for ceph and corosync ( using the bond )

CEPH

1 MGR per node

1 MON per node

6 OSD per node

1 replicated pool ( min 2 max 3 )

Simplified Crush Map :

Code:

root ceph-prod-3az zone 1 host a osd.0 osd.1 osd.2 osd.3 osd.4 osd.5 host b osd.6 osd.7 osd.8 osd.9 osd.10 osd.11 zone 2 host c osd.0 osd.1 osd.2 osd.3 osd.4 osd.5 host d osd.6 osd.7 osd.8 osd.9 osd.10 osd.11 zone 3 host e osd.0 osd.1 osd.2 osd.3 osd.4 osd.5 host f osd.6 osd.7 osd.8 osd.9 osd.10 osd.11

Crush rule :

Code:

rule 3az_rule { id 1 type replicated step take ceph-prod-3az class nvme step choose firstn 3 type zone step chooseleaf firstn 2 type host step emit }

Problems

I have mainly postgresql databases vm , below this a htop on a vm with high io wait
View attachment 96607

IO View on data disk of this VM
View attachment 96608

A node export view on this VM
View attachment 96609

Vm drive config:

View attachment 96613

What i have verified

There is no OSD latency problem

Network is not saturated

Host is not overloaded ( ~ 20% cpu usage, 50% RAM usage per node )

Tuned a bit softnet settings to obliterate packets drops

I have noticed some network errors, but seems ok? :

Graph of one node
View attachment 96610

Did someone have experienced that and have an idea ?

I'm experiencing the same thing. There's something strange going on with Proxmox 9.1.1, and for some reason, my IOPS jumps to 500%. I haven't found a solution.

The actual disk is fast, but slower than Proxmox

I think there's something wrong, a bug. By the way, I've formatted it 14 times.

spirit · Apr 1, 2026

Alwin Antreich said:
Good to know.

That's true, but I think it would be more optimized to have the node call it after ceph is up to add the crush_location, instead of each VM/CT separately.
Anyway, just my two cents.

the thing is that the host itself don't mount the rbd. It's done directly by each qemu process. the librbd is inside qemu, and this is the qemu process which is doing to connections to monitor && osd.

(with kbrd it's different, as it's done by the host, but I don't have any experience with it, not sure it the hook is working)

spirit · Apr 1, 2026

djsami said:
I'm experiencing the same thing. There's something strange going on with Proxmox 9.1.1, and for some reason, my IOPS jumps to 500%. I haven't found a solution.

View attachment 96837 View attachment 96838

View attachment 96840

The actual disk is fast, but slower than Proxmox I think there's something wrong, a bug. By the way, I've formatted it 14 times.

please don't double post,

U

[SOLVED] Thread 'Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository'

Mar 29, 2026

Applying patches to the Test Repository may have caused severe I/O delays and I/O pressure stalls.

The I/O pressure star value has reached nearly 100, but I can't see the load when I run `zpool iostat 1`.

If you reinstall PVE using `proxmox-ve_9.1-1.iso`, the value drops to between 0 and 1 (or at most around 5), but the problem recurs when you apply the test repository.

If you reinstall PVE using `proxmox-ve_9.0-1.iso` and then apply the non-subscription repositories, this issue does not occur.

I haven’t been able to pinpoint the cause yet because I don’t have...

the thread here is about

djsami said:
I'm experiencing the same thing. There's something strange going on with Proxmox 9.1.1, and for some reason, my IOPS jumps to 500%. I haven't found a solution.

View attachment 96837 View attachment 96838

View attachment 96840

The actual disk is fast, but slower than Proxmox I think there's something wrong, a bug. By the way, I've formatted it 14 times.

Hi, please don't double post

U

[SOLVED] Thread 'Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository'

Mar 29, 2026

Applying patches to the Test Repository may have caused severe I/O delays and I/O pressure stalls.

The I/O pressure star value has reached nearly 100, but I can't see the load when I run `zpool iostat 1`.

If you reinstall PVE using `proxmox-ve_9.1-1.iso`, the value drops to between 0 and 1 (or at most around 5), but the problem recurs when you apply the test repository.

If you reinstall PVE using `proxmox-ve_9.0-1.iso` and then apply the non-subscription repositories, this issue does not occur.

I haven’t been able to pinpoint the cause yet because I don’t have...

as your problem seem to be a bug with pve-qemu 10.2.1 from test repo , and here we are talking about a specific ceph latency problem across multiple zones with qemu 10.1.2.

(but thanks for the report on the other thread)

neilmcm · 2026-06-30T19:16:20+0200

tchaikov said:
I concur with @spirit. Also, I double checked the Ceph's source code, actually all ceph daemons (including OSDs) and Ceph client respect this setting. The call chain looks like:

-- global_init() ->
---- crush_location.init_on_startup() ->
----- update_from_hook() or use the default host=<hostname>; root=default, if neither `crush_location` nor `crush_location_hook` is specified.
------ start a subprocess calling the script like

Bash:

$hook --cluster $cluster_name --id $id --type $type

where $type is the ceph application's entity type and $id is the entity id, and, if the entity is 'client.admin', then they are client and admin respectively, if it's osd.0, they are osd and 0.

We've been trying to get read localization working with a similar environment to that described here - in our case across 3 datacenters. This is with 9.2.2 (Enterprise) and Tentacle 20.2.1. We've also tried reverting to Squid in our lab to see if that made any difference, to no avail.

The only working scenario we've found is by hardcoding crush_location into the [client] section of /etc/pve/ceph.conf - every other option we've tried that's been discussed in this thread doesn't appear to actually set the CRUSH location for the VM.

With debug enabled we can see the CRUSH location being used. When we hardcode crush_location into the [client] section of /etc/pve/ceph.conf we see the location we'd expect, including a datacenter reference:

Code:

2026-06-30T17:27:33.203+0100 71614e7fc6c0  5 get_common_ancestor_distance 0 {{datacenter=dc1,host=nub-lab-pve01,root=default}}

In all other scenarios we only see the default location:

Code:

2026-06-30T17:30:45.700+0100 7a0a627fc6c0  5 get_common_ancestor_distance 0 {{host=nub-lab-pve01,root=default}}

We've tried using crush_location_hook in both [global] and [client] - with output to both stdout and a log file so we can confirm if it's being executed. This script never gets called when the VM is started (but works fine when run manually).

Code:

#!/bin/sh

LOCATION="root=default datacenter=dc1 host=nub-lab-pve01"
echo $LOCATION
echo "$(date) - $LOCATION" >> /var/log/crush-location-wrapper.log

We've also tried setting crush_location via 'ceph config set':

Code:

root@nub-lab-pve01:~# ceph config dump
WHO                   MASK  LEVEL     OPTION                                     VALUE                                           RO
...
client.nub-lab-pve01        advanced  crush_location                             root=default datacenter=dc1 host=nub-lab-pve01  *
client.nub-lab-pve02        advanced  crush_location                             root=default datacenter=dc2 host=nub-lab-pve02  *
client.nub-lab-pve03        advanced  crush_location                             root=default datacenter=dc3 host=nub-lab-pve03  *

In all scenarios, we've tried setting rbd_read_from_replica_policy to localize at pool level, via 'ceph config set' and in the [global] and [client] sections of ceph.conf. We've also shutdown and restarted the VM in each test.

Any pointers much appreciated.

Ceph - VM with high IO wait

tchaikov

New Member

Alwin Antreich

Renowned Member

djsami

Renowned Member

spirit

Distinguished Member

spirit

Distinguished Member

[SOLVED] Thread 'Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository'

[SOLVED] Thread 'Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository'

neilmcm

New Member

We value your privacy