Ceph performance in VM not that good

Marco Lucarelli

Active Member
Feb 5, 2019
5
2
43
34
Hello,
I have an 3 node full mesh ceph cluster with 4x 100 GE NIC per Node. With the SDN stack, I use openfabric for the NICs and on top ceph.
For corosync, there are 2 dedicated NICS via switch (no lacp, bonding/teaming), each corosync link is in his own small network. VM traffic and management are also on 2 NICs with lacp.

Components are (each node):
- 5x NVMe KIOXIA KCD8DPUG6T40
- 2x Broadcom 57508 100GbE QSFP56 2-port PCIe 4
- 4x Broadcom 57414 10/25GbE SFP28 2-port

The OSDs are the full size of the nvme ssd. My CRUSH map is configured with step chooseleaf host, ensuring that replicas are distributed across different nodes.
How have I run the benchmarks?

Code:
# on a pve node
:~# rados bench -p ceph-pool 60 write -b 4K -t 180

Total time run:         60.0012
Total writes made:      4877074
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     317.512
Stddev Bandwidth:       36.8077
Max bandwidth (MB/sec): 362.797
Min bandwidth (MB/sec): 206.008
Average IOPS:           81282
Stddev IOPS:            9422.78
Max IOPS:               92876
Min IOPS:               52738
Average Latency(s):     0.00221341
Stddev Latency(s):      0.00353256
Max latency(s):         0.223371
Min latency(s):         0.000384809

The maximum that I have reach with fio and randwrite:
Code:
# on a pve node
# ceate 16 10 GB images
:~# for i in {0..15}; do rbd create ceph-pool/disk.$i --size 10G; done
# run fio
:~# fio --ioengine=rbd --clientname=admin --pool=ceph-pool \
  --rw=randwrite --bs=4k --direct=1 --iodepth=64 --runtime=30 --group_reporting \
  --name=job0 --rbdname=disk.0 \
  --name=job1 --rbdname=disk.1 \
  --name=job2 --rbdname=disk.2 \
  --name=job3 --rbdname=disk.3 \
  --name=job4 --rbdname=disk.4 \
  --name=job5 --rbdname=disk.5 \
  --name=job6 --rbdname=disk.6 \
  --name=job7 --rbdname=disk.7 \
  --name=job8 --rbdname=disk.8 \
  --name=job9 --rbdname=disk.9 \
  --name=job10 --rbdname=disk.10 \
  --name=job11 --rbdname=disk.11 \
  --name=job12 --rbdname=disk.12 \
  --name=job13 --rbdname=disk.13 \
  --name=job14 --rbdname=disk.14 \
  --name=job15 --rbdname=disk.15

job0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job3: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job4: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job6: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job7: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job8: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job9: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job10: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job11: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job12: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job13: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job14: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job15: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
fio-3.39
Starting 16 processes
Jobs: 16 (f=16): [w(16)][100.0%][w=822MiB/s][w=210k IOPS][eta 00m:00s]
job0: (groupid=0, jobs=16): err= 0: pid=707247: Thu Apr  9 09:21:29 2026
write: IOPS=203k, BW=792MiB/s (831MB/s)(23.2GiB/30016msec); 0 zone resets
slat (nsec): min=611, max=2237.3k, avg=7741.25, stdev=6652.07
clat (usec): min=404, max=129706, avg=5038.16, stdev=5990.21
lat (usec): min=406, max=129709, avg=5045.90, stdev=5990.25
clat percentiles (usec):
|  1.00th=[  848],  5.00th=[ 1156], 10.00th=[ 1401], 20.00th=[ 1795],
| 30.00th=[ 2180], 40.00th=[ 2606], 50.00th=[ 3097], 60.00th=[ 3752],
| 70.00th=[ 4752], 80.00th=[ 6456], 90.00th=[10552], 95.00th=[16057],
| 99.00th=[31065], 99.50th=[38536], 99.90th=[57934], 99.95th=[64750],
| 99.99th=[74974]
bw (  KiB/s): min=610792, max=907760, per=100.00%, avg=811879.11, stdev=3704.60, samples=954
iops        : min=152698, max=226940, avg=202969.74, stdev=926.15, samples=954
lat (usec)   : 500=0.01%, 750=0.45%, 1000=2.08%
lat (msec)   : 2=22.73%, 4=37.71%, 10=26.22%, 20=7.63%, 50=2.96%
lat (msec)   : 100=0.21%, 250=0.01%
cpu          : usr=11.85%, sys=5.49%, ctx=4875027, majf=0, minf=10411
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,6088327,0,0 short=0,0,0,0 dropped=0,0,0,0
latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=792MiB/s (831MB/s), 792MiB/s-792MiB/s (831MB/s-831MB/s), io=23.2GiB (24.9GB), run=30016-30016msec

Disk stats (read/write):
dm-5: ios=0/1319, sectors=0/18104, merge=0/0, ticks=0/106, in_queue=106, util=0.09%, aggrios=12/564, aggsectors=3072/18104, aggrmerge=0/0, aggrticks=3/87, aggrin_queue=90, aggrutil=0.08%
nvme0n1: ios=12/564, sectors=3072/18104, merge=0/0, ticks=3/87, in_queue=90, util=0.08%

Maximum randread:
Code:
# on a pve node
# ceate 16 10 GB images
:~# for i in {0..15}; do rbd create ceph-pool/disk.$i --size 10G; done
# run fio
:~# fio --ioengine=rbd --clientname=admin --pool=ceph-pool \
  --rw=randread --bs=4k --direct=1 --iodepth=64 --runtime=30 --group_reporting \
  --name=job0 --rbdname=disk.0 \
  --name=job1 --rbdname=disk.1 \
  --name=job2 --rbdname=disk.2 \
  --name=job3 --rbdname=disk.3 \
  --name=job4 --rbdname=disk.4 \
  --name=job5 --rbdname=disk.5 \
  --name=job6 --rbdname=disk.6 \
  --name=job7 --rbdname=disk.7 \
  --name=job8 --rbdname=disk.8 \
  --name=job9 --rbdname=disk.9 \
  --name=job10 --rbdname=disk.10 \
  --name=job11 --rbdname=disk.11 \
  --name=job12 --rbdname=disk.12 \
  --name=job13 --rbdname=disk.13 \
  --name=job14 --rbdname=disk.14 \
  --name=job15 --rbdname=disk.15

job0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job5: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job6: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job7: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job8: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job9: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job10: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job11: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job12: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job13: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job14: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
job15: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64
fio-3.39
Starting 16 processes
Jobs: 16 (f=16): [r(16)][100.0%][r=3720MiB/s][r=952k IOPS][eta 00m:00s]
job0: (groupid=0, jobs=16): err= 0: pid=726172: Thu Apr  9 09:51:29 2026
read: IOPS=880k, BW=3437MiB/s (3604MB/s)(101GiB/30002msec)
slat (nsec): min=200, max=1808.1k, avg=3147.88, stdev=4942.71
clat (usec): min=20, max=92586, avg=1159.71, stdev=1758.45
lat (usec): min=84, max=92588, avg=1162.85, stdev=1758.50
clat percentiles (usec):
|  1.00th=[  215],  5.00th=[  314], 10.00th=[  383], 20.00th=[  510],
| 30.00th=[  644], 40.00th=[  766], 50.00th=[  898], 60.00th=[ 1029],
| 70.00th=[ 1188], 80.00th=[ 1401], 90.00th=[ 1811], 95.00th=[ 2311],
| 99.00th=[ 6325], 99.50th=[15533], 99.90th=[24773], 99.95th=[27657],
| 99.99th=[34866]
bw (  MiB/s): min= 2495, max= 3941, per=99.96%, avg=3435.82, stdev=26.05, samples=944
iops        : min=638770, max=1009144, avg=879570.19, stdev=6667.98, samples=944
lat (usec)   : 50=0.01%, 100=0.01%, 250=1.98%, 500=17.12%, 750=19.44%
lat (usec)   : 1000=19.40%
lat (msec)   : 2=34.48%, 4=6.08%, 10=0.80%, 20=0.40%, 50=0.30%
lat (msec)   : 100=0.01%
cpu          : usr=27.02%, sys=10.88%, ctx=10612785, majf=0, minf=5192
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=26398359,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=3437MiB/s (3604MB/s), 3437MiB/s-3437MiB/s (3604MB/s-3604MB/s), io=101GiB (108GB), run=30002-30002msec

Disk stats (read/write):
dm-5: ios=0/1133, sectors=0/48208, merge=0/0, ticks=0/160, in_queue=160, util=0.15%, aggrios=12/571, aggsectors=3072/48208, aggrmerge=0/0, aggrticks=3/1261, aggrin_queue=1264, aggrutil=0.15%
nvme0n1: ios=12/571, sectors=3072/48208, merge=0/0, ticks=3/1261, in_queue=1264, util=0.15%

In my test VM (Debian 13, no DE) on my ceph pool, the best that I could measure was like 32k IOPS. Does anyone have any tips on how to get more IOPS out of a single VM?
My test VM config:
Code:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 16
cpu: host
efidisk0: ceph-pool:vm-1000-disk-0,efitype=4m,ms-cert=2023w,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
machine: q35
memory: 8196
meta: creation-qemu=10.1.2,ctime=1775721997
name: abamalu-test
net0: virtio=BC:24:CD:43:EF:1A,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: ceph-pool:vm-1000-disk-1,aio=io_uring,cache=writeback,discard=on,iothread=1,queues=16,size=150G
scsihw: virtio-scsi-single
smbios1: uuid=3c27e036-38f2-4ade-b18a-7fede8da5884
sockets: 1
vmgenid: d2022e80-d957-443d-8db1-00549a28f3f1
vmstatestorage: ceph-pool