Ceph - VM with high IO wait

Phreak

New Member
Mar 18, 2026
2
0
1
Hello everyone,

I have spent a lot of time to figure out what is provoking that IO wait. on this cluster VM who do a high amount of IO have a lot of IO wait ( like 30k I/O read, 50% IO wait)

Summary of my setup:

PVE

Code:
proxmox-ve: 9.0.0 (running kernel: 6.17.2-2-pve)
pve-manager: 9.0.18 (running version: 9.0.18/5cacb35d7ee87217)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.17: 6.17.2-2
proxmox-kernel-6.8: 6.8.12-17
proxmox-kernel-6.8.12-17-pve-signed: 6.8.12-17
amd64-microcode: 3.20250311.1
ceph: 19.2.3-pve4
ceph-fuse: 19.2.3-pve4
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20251111.1~deb13u1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.4
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.0.15
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.3
libpve-rs-perl: 0.11.3
libpve-storage-perl: 9.0.18
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
openvswitch-switch: 3.5.0-1+b1
proxmox-backup-client: 4.1.0-1
proxmox-backup-file-restore: 4.1.0-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.2
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.0.9
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.0.8
pve-i18n: 3.6.4
pve-qemu-kvm: 10.1.2-4
pve-xtermjs: 5.5.0-3
qemu-server: 9.0.30
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

Hardware

I have 6 servers on a 3AZ (OVH) setup (2 servers per AZ) with this hardware
  • AMD EPYC GENOA 9354 ( 32C/64T )
  • 512GO DDR5
  • 6 x nvme enterprise grade 7To for OSD
  • 4 x 25Gb network card ( mellanox ) bounded together

Network

I have no other choice to bound the 4 25G port for thaht i ma using openv-switch with balance-tcp mode

1773826978503.png

I have a dedicated interface/vlan for ceph and corosync ( using the bond )



CEPH

  • 1 MGR per node
  • 1 MON per node
  • 6 OSD per node
  • 1 replicated pool ( min 2 max 3 )
Simplified Crush Map :
Code:
root ceph-prod-3az

zone 1
  host a
    osd.0
    osd.1
    osd.2
    osd.3
    osd.4
    osd.5
  host b
    osd.6
    osd.7
    osd.8
    osd.9
    osd.10
    osd.11

zone 2
  host c
    osd.0
    osd.1
    osd.2
    osd.3
    osd.4
    osd.5
  host d
    osd.6
    osd.7
    osd.8
    osd.9
    osd.10
    osd.11

zone 3
  host e
    osd.0
    osd.1
    osd.2
    osd.3
    osd.4
    osd.5
  host f
    osd.6
    osd.7
    osd.8
    osd.9
    osd.10
    osd.11

Crush rule :
Code:
rule 3az_rule {
    id 1
    type replicated
    step take ceph-prod-3az class nvme
    step choose firstn 3 type zone
    step chooseleaf firstn 2 type host
    step emit
}

Problems


I have mainly postgresql databases vm , below this a htop on a vm with high io wait
1773823319582.png


IO View on data disk of this VM
1773823442951.png

A node export view on this VM
1773823578965.png

Vm drive config:

1773827421329.png

What i have verified


  • There is no OSD latency problem
  • Network is not saturated
  • Host is not overloaded ( ~ 20% cpu usage, 50% RAM usage per node )
  • Tuned a bit softnet settings to obliterate packets drops
I have noticed some network errors, but seems ok? :


Graph of one node
1773824244932.png

Did someone have experienced that and have an idea ?
 
Last edited:
Looking at the graphs you posted, a few things stand out.

The Ceph pool metrics show almost no writes — IOPS and throughput on `pve_ceph_prod_3az` are nearly 100% read. Cross-AZ write latency is not what's hurting you here. The good news is that Ceph read latency (200–500μs) is actually healthy for this setup.

The IO wait is coming from somewhere above Ceph. A few things are worth addressing.

Wrong disk cache mode for PostgreSQL

Your VM config shows `cache=writeback` on both disks. For PostgreSQL this is the wrong setting.

For RBD-backed disks, `cache=writeback` and `cache=none` don't control a QEMU-level buffer — they control librbd's own client-side write-back cache (`rbd_cache`). With `cache=writeback`, QEMU enables `rbd_cache=true` in the librbd connection, so librbd buffers writes in QEMU process memory before flushing to the OSDs. With `cache=none`, QEMU sets `rbd_cache=false`, and I/O goes directly to the OSDs with no librbd buffer.

For PostgreSQL, `rbd_cache=true` (writeback) creates a redundant caching layer: the same data sits in PostgreSQL's `shared_buffers`, in the guest OS page cache, and again in librbd's in-process buffer — wasting host RAM for a copy that is already held in the guest. There is also a durability concern: librbd's write-back cache buffers writes before flushing to the OSDs; if the QEMU process crashes between a write and the next fsync(), data in that buffer is lost. (Unlike `cache=unsafe`, `cache=writeback` does honor guest fsync() by flushing librbd's cache — the risk window is between individual writes and the fsync.)

The correct setting for database workloads is `cache=none`, which disables librbd's write-back cache and sends I/O directly to the OSDs:

Bash:
# Run on the Proxmox host (takes effect after VM restart):
qm set 108 --scsi0 pve_ceph_prod_3az:vm-108-disk-0,cache=none,discard=on,iothread=1,size=34G,ssd=1
qm set 108 --scsi1 pve_ceph_prod_3az:vm-108-disk-1,cache=none,discard=on,iothread=1,size=1T,ssd=1


OVS bond — check for NIC imbalance​


Your bond config has `other_config:bond-rebalance-interval=0`. With `balance-tcp`, OVS hashes each flow to a NIC based on the TCP 4-tuple. Since Ceph uses long-lived persistent TCP connections between fixed OSD pairs, a hot read path could end up permanently pinned to one 25G NIC. Whether `bond-rebalance-interval=0` disables rebalancing entirely or uses a default is worth checking in your OVS version — but the per-NIC stats will tell you directly whether traffic is imbalanced:

Bash:
ovs-appctl bond/show bond0

# Replace <slave> with the interface names listed under "slave" in bond/show output above:
ethtool -S <slave1> | grep tx_bytes
ethtool -S <slave2> | grep tx_bytes
ethtool -S <slave3> | grep tx_bytes
ethtool -S <slave4> | grep tx_bytes


If one NIC is carrying significantly more traffic than the others, enabling periodic rebalancing will redistribute the load. OVS balance-tcp assigns flows to hash buckets (256 total), and rebalancing moves buckets — along with all their established connections — from overloaded NICs to underloaded ones. Long-lived Ceph connections are affected immediately when their bucket is migrated:

Bash:
ovs-vsctl set port bond0 other_config:bond-rebalance-interval=10000

The TCP Errors graph shows a RetransSegs spike to 128 around 06:00 and a baseline of ~2.5/s retransmits. This could indicate momentary congestion on a hot NIC, though it could also be from other causes (LACP events, Ceph messenger connection recycling). The per-NIC stats will clarify.

PostgreSQL shared_buffers (if the above don't fully resolve it)​


The htop shows 73.4 MiB/s of reads at 87.5% disk utilization going through to Ceph. At ~4KB per random read, that's ~18K IOPS of page cache misses. If PostgreSQL's `shared_buffers` is at or near the default (128MB), most of the active working set doesn't fit in the buffer pool, causing frequent page misses that reach Ceph. Check:

SQL:
SHOW shared_buffers;
SHOW effective_cache_size;

A starting point (adjust to your VM's actual RAM — the standard recommendation is 25%/75%):

Code:
shared_buffers = 6GB          # 25% of VM RAM, assuming ~24GB
effective_cache_size = 18GB   # 75% of VM RAM

This won't eliminate Ceph reads entirely, but should absorb the hot working set in memory and significantly reduce the read IOPS hitting Ceph.



One question: is the 30k IOPS figure from `iostat` inside the VM, or from the Ceph dashboard? The Grafana shows 15–20K at peak on the pool — since both disks are on `pve_ceph_prod_3az`, the pool already aggregates reads from both, so they should be directly comparable. A mismatch suggests the measurements were taken at different times, or that `rbd_cache` is serving some reads client-side before they reach the OSDs.
 
Last edited:
Strange that you also have high memory pressure "PSI some memory". do you have enable numa option on the vm ?

you can also look at host numa stat
Code:
# apt install numactl
# numstat
and look if you don't have a lot of "numa_miss" vs "numa_hit"

on rbd side, you can also give a try to krbd vs librbd, and a little bit faster (maybe 10%).

if you can (I'm not sure with ovh),on the host hardware size, check that your server is set in "max performance". you want the cpu running always at their max frequencies && max_cstate=1 to avoid latencies.

also, where are physically located the 3 AZ at ovh ? is it their new DC at Paris ? or the DC older at Roubaix-Gravelline-Strasbourg ?
What is the latency between sites ? because for write, with queue depth=1, if you have 1ms latency, you can do 1000iops max. I'm not sure about the behaviour of postgresql journal && parallelism


if you use proxmox in hyperconverged, you can also setup local read locality (to avoid to do read across AZ)

Code:
rbd config pool set POOLNAME rbd_read_from_replica_policy localize

and also look at postgresql shared_buffers as thaicov said. (I'm not sure, but maybe memory pressure could be related to too low shared_buffers, so constant rbd read + swap in/out memory pages in shared_buffers)
 
Last edited:
Thanks all for your responses !!

This 3AZ setup is in Paris ( ~ 30km between each zone )

Latency


i have measured latency with ping :
Host - Zonems
Host 1 zone C → Host 2 zone A0.6
Host 1 zone C → Host 4 zone B0.9
Host 1 zone C > Host 6 zone C0.06
Host 1 zone C → Host 5 zone A0.6
Host 1 zone C → Host 3 zone B0.9
Host 3 zone B → Host 1 zone C0.9
Host 3 zone B → Host 2 zone A0.8
Host 2 zone A → Host 1 zone C0.6
Host 2 zone A → Host 3 zone B0.8

disk cache mode​


I didn't know that writeback enable rbd_cache good to know, on a test vm i have tried cache=none and a pg_bench run and i didn't notice i real difference, iowait rapidly grow near 30%.

NIC imbalance​


I have checked nic imbalance and yes one of the 4 nic have 3X rx value than others, i have adjusted rebalance-interval to 10000 now i see that rx value is more homogeneous. i think i have set rebalance-interval=0 because i have read somewhere there are cases that caused packet drops.

PostgreSQL​


I think there are no topics on postgresql config the shared buffer is equal to 50% RAM and effecitve_cache 75% RAM maybe this particular database need more ram but that application were battle tested and was working well on previous vmware setup.

C-STATE

Officially i cannot modify BIOS settings, it is technically possible but i don't want to have problem with OVH support ;)
Yesterday i have noticed that on reading other topics about slow IO, so i have added processor.max_cstate=1 and i currently rebooting each node to try it

Numa

I have read about this subject but do nothing because i don't fully understand what to do exactly.

stats have been reset 1 hour ago as i have rebooted hosts to set processor.max_cstate=1

Node 1
Code:
                           node0           node1           node2           node3
numa_hit                75983484        90862186        88750035        96473790
numa_miss                      0         1842506               0               0
numa_foreign             1842506               0               0               0
interleave_hit               704             802             700             795
local_node              75625908        89483045        88302607        95943651
other_node                357576         3221647          447428          530139

Node 1 topography
Code:
Machine (504GB total)
  Package L#0
    Group0 L#0
      NUMANode L#0 (P#0 63GB)
      Die L#0 + L3 L#0 (32MB)
        L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
          PU L#0 (P#0)
          PU L#1 (P#32)
        L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
          PU L#2 (P#1)
          PU L#3 (P#33)
        L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
          PU L#4 (P#2)
          PU L#5 (P#34)
        L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
          PU L#6 (P#3)
          PU L#7 (P#35)
      Die L#1 + L3 L#1 (32MB)
        L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
          PU L#8 (P#4)
          PU L#9 (P#36)
        L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
          PU L#10 (P#5)
          PU L#11 (P#37)
        L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
          PU L#12 (P#6)
          PU L#13 (P#38)
        L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
          PU L#14 (P#7)
          PU L#15 (P#39)
      HostBridge
        PCIBridge
          PCI 81:00.0 (Ethernet)
            Net "eno1n0"
            OpenFabrics "mlx5_2"
          PCI 81:00.1 (Ethernet)
            Net "eno1n1"
            OpenFabrics "mlx5_3"
    Group0 L#1
      NUMANode L#1 (P#1 189GB)
      Die L#2 + L3 L#2 (32MB)
        L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
          PU L#16 (P#8)
          PU L#17 (P#40)
        L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
          PU L#18 (P#9)
          PU L#19 (P#41)
        L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
          PU L#20 (P#10)
          PU L#21 (P#42)
        L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
          PU L#22 (P#11)
          PU L#23 (P#43)
      Die L#3 + L3 L#3 (32MB)
        L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
          PU L#24 (P#12)
          PU L#25 (P#44)
        L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
          PU L#26 (P#13)
          PU L#27 (P#45)
        L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
          PU L#28 (P#14)
          PU L#29 (P#46)
        L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
          PU L#30 (P#15)
          PU L#31 (P#47)
      HostBridge
        PCIBridge
          PCI c1:00.0 (Ethernet)
            Net "eno0n0"
            OpenFabrics "mlx5_0"
          PCI c1:00.1 (Ethernet)
            Net "eno0n1"
            OpenFabrics "mlx5_1"
        PCIBridge
          PCI c2:00.0 (NVMExp)
            Block(Disk) "nvme0n1"
        PCIBridge
          PCI c3:00.0 (NVMExp)
            Block(Disk) "nvme2n1"
        PCIBridge
          PCI c4:00.0 (NVMExp)
            Block(Disk) "nvme4n1"
        PCIBridge
          PCI c5:00.0 (NVMExp)
            Block(Disk) "nvme7n1"
        PCIBridge
          PCIBridge
            PCI c7:00.0 (VGA)
        PCIBridge
          2 x { PCI c9:00.0-1 (SATA) }
    Group0 L#2
      NUMANode L#2 (P#2 126GB)
      Die L#4 + L3 L#4 (32MB)
        L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
          PU L#32 (P#16)
          PU L#33 (P#48)
        L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
          PU L#34 (P#17)
          PU L#35 (P#49)
        L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
          PU L#36 (P#18)
          PU L#37 (P#50)
        L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
          PU L#38 (P#19)
          PU L#39 (P#51)
      Die L#5 + L3 L#5 (32MB)
        L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
          PU L#40 (P#20)
          PU L#41 (P#52)
        L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
          PU L#42 (P#21)
          PU L#43 (P#53)
        L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
          PU L#44 (P#22)
          PU L#45 (P#54)
        L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
          PU L#46 (P#23)
          PU L#47 (P#55)
    Group0 L#3
      NUMANode L#3 (P#3 126GB)
      Die L#6 + L3 L#6 (32MB)
        L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
          PU L#48 (P#24)
          PU L#49 (P#56)
        L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
          PU L#50 (P#25)
          PU L#51 (P#57)
        L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
          PU L#52 (P#26)
          PU L#53 (P#58)
        L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
          PU L#54 (P#27)
          PU L#55 (P#59)
      Die L#7 + L3 L#7 (32MB)
        L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
          PU L#56 (P#28)
          PU L#57 (P#60)
        L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
          PU L#58 (P#29)
          PU L#59 (P#61)
        L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
          PU L#60 (P#30)
          PU L#61 (P#62)
        L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
          PU L#62 (P#31)
          PU L#63 (P#63)
      HostBridge
        PCIBridge
          PCI 03:00.0 (NVMExp)
            Block(Disk) "nvme1n1"
        PCIBridge
          PCI 04:00.0 (NVMExp)
            Block(Disk) "nvme3n1"
        PCIBridge
          PCI 05:00.0 (NVMExp)
            Block(Disk) "nvme6n1"
        PCIBridge
          PCI 06:00.0 (NVMExp)
            Block(Disk) "nvme5n1"
        PCIBridge
          2 x { PCI 08:00.0-1 (SATA) }

Node 2
Code:
                           node0           node1           node2           node3
numa_hit               146924907       127109111       145433246       145334667
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit               714             795             706             785
local_node             146219749       125379348       145121207       144783312
other_node                705158         1729763          312039          551355

Node 3
Code:
                           node0           node1           node2           node3
numa_hit                76381512        65817387        60399964        78683224
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit               726             759             713             760
local_node              75685128        65723073        59819075        78134445
other_node                696384           94314          580889          548779


Ceph read policy

i have read on ceph doc about rbd_read_from_replica_policy but whate is the drawback of using it ? it seems really what i need.

Graph
One question: is the 30k IOPS figure from `iostat` inside the VM, or from the Ceph dashboard? The Grafana shows 15–20K at peak on the pool — since both disks are on `pve_ceph_prod_3az`, the pool already aggregates reads from both, so they should be directly comparable. A mismatch suggests the measurements were taken at different times, or that `rbd_cache` is serving some reads client-side before they reach the OSDs.

The 30K IOPS is from ceph dashboard, and each screenhosts has been made at the same times. Has you say it'is probably cause of rbd_cache, i can't try to disable writeback to test it on production vm as i want :p