Disk Speeds inside a VM.

- Changed from LVM-Thin to LVM, now getting 694 IOPS on the guest, So getting close to 25% of metal.
- Then moved the disk to EXT4 storage, now getting 767 IOPS on the guest, ~28% of metal.
- Also tried the disk on XFS storage, get about the same 766 on the guest.
- Last thing I tried, turned off mitigations on the pve-host (still enabled on guest), saw about 900 IOPS, still not 50% but double where I started...

So I guess what we'd all like to know, how the heck are people getting 80%-50% IO performance in their VM guests? What are we all doing wrong? As I see so many threads/tickets complaining about the same thing...

So far these seem to get me better performance:
- LVM not LVM-Thin (doesn't thin provision will use more disk, snapshots are larger?)
- Use Virtio-blk with IO Thread, Discard
- Use raw image (I think default for LVM anyway)
- Make sure using Host CPU, Make sure NUMA is selected and matching socket count (if your host has that)
- Make sure you set WCE=1 for your drives (make ent SAS drives tho!)
- Set mitigations=off - not for prod or any sort of system with untrusted actors, make your own assessment for risk.
 
  • Like
Reactions: zvangi
- Use Virtio-blk with IO Thread, Discard
Don't use discard, trim weekly (or daily ifwant) by cron - that kill's performance !!
- Make sure using Host CPU, Make sure NUMA is selected and matching socket count (if your host has that)
We mostly use default, select just "host" if trying some nested virtualization simulations.
- LVM not LVM-Thin (doesn't thin provision will use more disk, snapshots are larger?)
We use xfs or nfs on xfs without lvm, test some zvol/zfs. Tune xfs on raidset and raid-ctrl, mkfs, mount, /proc and /sys, tune nfs mount if using.
Where is this setting ?
 
Same, I think I am at about 25% performance based on the stats I get from the host, vs running the same tests in the guest.
I'm really surprised I can't find out more about it, there's not alot of info on the forums/google that I can find.
 
Same, I think I am at about 25% performance based on the stats I get from the host, vs running the same tests in the guest.
I'm really surprised I can't find out more about it, there's not alot of info on the forums/google that I can find.
Oh yeah, I have read probably every thread/SO/post/video and never seems to be an actual "this is whats going on" they just tail off, either with no answer or the OP gave up. The improvements I made were for sure noticeable (GitHub Runner).
If the paid support wasn't stupid expensive (and per CPU socket, I have 6 for my homelab) I would buy it, but can't really justify $2k (I could just buy a couple of mac mini's for those CI jobs and call it a day...)
 
So I guess what we'd all like to know, how the heck are people getting 80%-50% IO performance in their VM guests? What are we all doing wrong?
Tune xfs host on raidset (1M stripe) and raid-ctrl (lsi: 1s flush, 5x rates 90%, direct I/O, hp: r/w cache 90/10%), mkfs.xfs -n size=16384 /dev/..., mount pqnoenforce,logbufs=8,logbsize=256k, /proc and /sys, tune nfs mount if using.
After boot:
echo 1 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
echo 1000 > /proc/sys/vm/dirty_expire_centisecs
echo 300 > /proc/sys/vm/dirty_writeback_centisecs
echo 1000 > /proc/sys/fs/xfs/xfssyncd_centisecs
echo 25 > /proc/sys/vm/vfs_cache_pressure
echo 0 > /proc/sys/vm/swappiness
echo 2097152 > /proc/sys/vm/min_free_kbytes
echo 1 > /proc/sys/kernel/task_delayacct
d=<disk|nvme> # eg sdb, nvme1n1
echo mq-deadline > /sys/block/$d/queue/scheduler
echo 1024 > /sys/block/$d/queue/nr_requests;done
echo 4096 > /sys/block/$d/queue/read_ahead_kb
nfs mount vers=4.2,rsize=1048576,wsize=1048576,hard,proto=tcp,nconnect=4
After nfs mount at dir eg m=/nfs-mnt
echo 8192 > /sys/class/bdi/$(mountpoint -d $m)/read_ahead_kb
Create VM , create 2nd disk virtIO-block, cache writeback, inside mkfs.xfs /dev/vd<b?>, mount /dev/vdb /usr2 (?),
fio on host, fio in vm.
 
Last edited:
@waltar - we really appreciate your input, but this is something more fundamental. Two partitions off the same physical device produce massively different results, depending on whether the partition is being accessed via the host, or attached as storage for a VM to use.
 
Two partitions off the same physical device produce massively different results, depending on whether the partition is being accessed via the host, or attached as storage for a VM to use.
Where are these 2 partitions ? I only have min. nr of partitons on the OS disk and host option disks doesn't have any at all.
And the disk is attached to host AND via virt disk to vm same time and not by "or".
 
I am experiencing exactly this too, i have done everything i can find, enabling huge pages helped by 15-18% but overall even if i can get remotely decent results for read/write it still seems to lag and latency seems to increase horribly under any kind of load which seems to result in much as 500-2000 ms or more access time, which sometimes can lead to vms being almost unusable under any kind of load unless nothing at all is running, which kind of seems counterproductive if i cant run anything inside vms without massive performance losses building up quickly.

I am using nvme drives and HGST enterprise drives, the host is extremely snappy and fast (i7-7820x+96gb ddr4) no lag at all, using vgpu, gpu passthrough or not doesn’t seem to matter, but VMs are barely on the edge of usable when anything is happening.

(All testing done with a single vm running to ensure nothing is competing and causing the problems)
 
Everybody has to and ccould decide which storage he implements for pve usage. As we use file storage and even remote as nfs qemu files are the result for vm's which doesn't mean we don't test zvol, zfs dataset and lvm but that just confirm our decision in overall usage and administration rating. Nevertheless different requirements, circumstances and personally opinions result to the many other different possible solutions.
 
  • Like
Reactions: Johannes S
Everybody has to and ccould decide which storage he implements for pve usage. As we use file storage and even remote as nfs qemu files are the result for vm's which doesn't mean we don't test zvol, zfs dataset and lvm but that just confirm our decision in overall usage and administration rating. Nevertheless different requirements, circumstances and personally opinions result to the many other different possible solutions.
Be this as it may, I suspect what we are seeing here is quite likely to be a speed / latency issue within proxmox itself that it is possible that the issue is typically overcome with scale that many small setup home users lack.

We are likely to see that once you hit a certain point the degradation in performance is negligible compared to the robust performance of hardware. But i am not sure where the line is and hopefully this is something that can be resolved within proxmox itself to help lower latency and increase throughput on the virtualization layer.

Within some of the latest updates moving from 8.1 to 8.2.7 i have seen quite a large improvement here, but it is still too slow in my opinion.

Latency is still too high, the slightest load seems to overwhelm the system despite plenty of resources being available and unused, so something is bottlenecking storage performance inside proxmox it seems.

Possibly a buffer/queue of sorts in linux or a driver that is causing the bottleneck/delay? I am not sure but it is definitely measurable and something you can noticeably feel.

Even running small applications that require disk usage the vm becomes noticeably slower, even just a few mb/s of disk usage it feels like i am on an xp machine that is about to freeze but still barely working and windows flash not responding, etc because the system can’t handle doing two things.

It also seems from benchmarks other hypervisors seem not to suffer from this or at least no where near the same degree. (Hyper-v, esxi, etc)
 
Last edited:
  • Like
Reactions: Johannes S
It also seems from benchmarks other hypervisors seem not to suffer from this or at least no where near the same degree. (Hyper-v, esxi, etc)
Do you happen to have a link to any relevant benchmarks in this regard? It would be especially interesting to compare results not just with Hyper-V and esxi but also with xen (another hypervisor engine) or other kvm-based frontends (so we know whether the problem lies in KVM or the way it's utilized in Proxmox VE).
 
Do you happen to have a link to any relevant benchmarks in this regard? It would be especially interesting to compare results not just with Hyper-V and esxi but also with xen (another hypervisor engine) or other kvm-based frontends (so we know whether the problem lies in KVM or the way it's utilized in Proxmox VE).
I will see if i can find them again and link them here, i was reading about it a week or two ago. (I have been trying to solve this problem for awhile to no avail.)

I would also like to see how proxmox holds up vs a xen hypervisor i hear they are better for windows virtualization. For now i am on proxmox for nvidia vgpu uses.

It certainly could lie within kvm or even linux itself and not proxmox at all.

It really does feel like there is some sort buffer/queue in the middle that just cannot handle enough from all the tests i have been running trying to find what is actually causing the problem.

I have tried all the cache options, cpu settings, disk settings in general including every controller type, just kept rebooting the vm with different configurations and nothing there helps.

So far enabling hugepages helped with throughput most giving an additional 15-18%

I did have some success with iothreads by manually boosting cpu priority of the process but somehow the vm died a few hours later and was basically irrecoverable. That did seem to lower latency and more than double performance, although it was very short lived. (Not sure why increasing io priority temporarily would destroy the vm but it seemed to)

Edit: another thing i have recently noticed is gpu vram read/write also maxes out at 2.5gb/s max i can get from nvmes too and some tests read at 4mb/s which also seems in line with some things read incredibly slow from disks, GPU intensive tasks also tend to cause the same slow downs. Just adding as maybe this can help diagnose exactly where the bottleneck is. (It could also be completely irrelevant but it seems weird that it has similar symptoms and bottlenecks) one gpu is vgpu the other fully passed to the vm, both max out at 2.5gb/s despite being on pcie 3.0 x8-x16 each
 
Last edited:
  • Like
Reactions: Johannes S
You asked for some benchmarks:

As I am just reading here and not using proxmox yet, these are from stock Debian 12, host and guest, reaching 75% of the native IOPS. Guest has only 1/2 GB, Host 32 GB RAM, do testing with 2GB should be fine. 4 Core Haswell w/o HT. QEMU-version is 7.2.

Disk is a KINGSTON SA400S37240G, used normally for booting only (replaced an USB thumb drive)
Partition is on LVM (Thick)
Driver is virtio. Why everyone here uses paravirtualized SCSI?

So far not using the newest features from https://kvm-forum.qemu.org/2024/IOThread_Virtqueue_Mapping_EGsYiZC.pdf, as not supported by the old QEMU.

Relevant parts of the configuration (libvirt used).
Code:
  <memory unit="KiB">524288</memory>
  <currentMemory unit="KiB">524288</currentMemory>
  <vcpu placement="static" current="2">4</vcpu>
  <iothreads>8</iothreads>
  <os>
    <type arch="x86_64" machine="pc-q35-7.2">hvm</type>
    <loader readonly="yes" secure="no" type="pflash">/usr/share/OVMF/OVMF_CODE_4M.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/testqemu72_VARS.fd</nvram>
    <boot dev="hd"/>
  </os>

<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native" discard="unmap"/>
  <source dev="/dev/vg9/testqemu72_root" index="2"/>
  <backingStore/>
  <target dev="vdb" bus="virtio"/>
  <alias name="virtio-disk1"/>
  <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
</disk>

Host:

Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=2G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 2048MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=4340KiB/s][w=1085 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=649612: Mon Nov 25 21:09:27 2024
  write: IOPS=1179, BW=4716KiB/s (4829kB/s)(1382MiB/300001msec); 0 zone resets
    clat (usec): min=669, max=21702, avg=845.91, stdev=688.95
     lat (usec): min=670, max=21702, avg=846.17, stdev=688.96
    clat percentiles (usec):
     |  1.00th=[  693],  5.00th=[  701], 10.00th=[  709], 20.00th=[  725],
     | 30.00th=[  734], 40.00th=[  742], 50.00th=[  758], 60.00th=[  766],
     | 70.00th=[  783], 80.00th=[  807], 90.00th=[  881], 95.00th=[ 1336],
     | 99.00th=[ 1532], 99.50th=[ 2073], 99.90th=[10945], 99.95th=[11600],
     | 99.99th=[17171]
   bw (  KiB/s): min= 2784, max= 5296, per=100.00%, avg=4719.56, stdev=317.31, samples=599
   iops        : min=  696, max= 1324, avg=1179.89, stdev=79.33, samples=599
  lat (usec)   : 750=45.41%, 1000=46.65%
  lat (msec)   : 2=7.43%, 4=0.07%, 10=0.07%, 20=0.37%, 50=0.01%
  cpu          : usr=0.45%, sys=4.54%, ctx=789903, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,353722,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=4716KiB/s (4829kB/s), 4716KiB/s-4716KiB/s (4829kB/s-4829kB/s), io=1382MiB (1449MB), run=300001-300001msec

Disk stats (read/write):
    dm-49: ios=0/2194694, merge=0/0, ticks=0/324232, in_queue=324232, util=90.52%, aggrios=0/1469908, aggrmerge=0/725417, aggrticks=0/270334, aggrin_queue=485601, aggrutil=90.17%
  sdc: ios=0/1469908, merge=0/725417, ticks=0/270334, in_queue=485601, util=90.17%

Guest

Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=2G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 2048MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=3868KiB/s][w=967 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=838: Mon Nov 25 21:02:57 2024
  write: IOPS=892, BW=3570KiB/s (3655kB/s)(1046MiB/300002msec); 0 zone resets
    clat (usec): min=825, max=30214, avg=1117.87, stdev=471.30
     lat (usec): min=826, max=30214, avg=1118.19, stdev=471.30
    clat percentiles (usec):
     |  1.00th=[  898],  5.00th=[  922], 10.00th=[  947], 20.00th=[  979],
     | 30.00th=[ 1004], 40.00th=[ 1020], 50.00th=[ 1045], 60.00th=[ 1074],
     | 70.00th=[ 1106], 80.00th=[ 1139], 90.00th=[ 1254], 95.00th=[ 1745],
     | 99.00th=[ 2147], 99.50th=[ 2311], 99.90th=[ 4948], 99.95th=[ 6325],
     | 99.99th=[26870]
   bw (  KiB/s): min= 2024, max= 3920, per=100.00%, avg=3571.29, stdev=212.13, samples=599
   iops        : min=  506, max=  980, avg=892.80, stdev=53.04, samples=599
  lat (usec)   : 1000=29.05%
  lat (msec)   : 2=68.65%, 4=2.13%, 10=0.15%, 20=0.01%, 50=0.02%
  cpu          : usr=0.48%, sys=3.49%, ctx=789422, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,267730,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=3570KiB/s (3655kB/s), 3570KiB/s-3570KiB/s (3655kB/s-3655kB/s), io=1046MiB (1097MB), run=300002-300002msec

Disk stats (read/write):
  vdb: ios=110/841084, merge=0/546440, ticks=53/283294, in_queue=472551, util=90.44%
 
Ok, I resized the LVs and ext4, similar results.

Host:
Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=3739KiB/s][w=934 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=651591: Mon Nov 25 22:49:41 2024
  write: IOPS=1090, BW=4363KiB/s (4468kB/s)(1278MiB/300001msec); 0 zone resets
    clat (usec): min=667, max=21405, avg=914.71, stdev=680.58
     lat (usec): min=667, max=21405, avg=914.95, stdev=680.58
    clat percentiles (usec):
     |  1.00th=[  693],  5.00th=[  709], 10.00th=[  717], 20.00th=[  725],
     | 30.00th=[  734], 40.00th=[  750], 50.00th=[  758], 60.00th=[  775],
     | 70.00th=[  807], 80.00th=[ 1029], 90.00th=[ 1352], 95.00th=[ 1434],
     | 99.00th=[ 1876], 99.50th=[ 2343], 99.90th=[11469], 99.95th=[11731],
     | 99.99th=[15401]
   bw (  KiB/s): min= 2848, max= 5160, per=100.00%, avg=4365.65, stdev=421.01, samples=599
   iops        : min=  712, max= 1290, avg=1091.41, stdev=105.25, samples=599
  lat (usec)   : 750=42.03%, 1000=36.71%
  lat (msec)   : 2=20.59%, 4=0.24%, 10=0.09%, 20=0.33%, 50=0.01%
  cpu          : usr=0.40%, sys=4.35%, ctx=839293, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,327241,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=4363KiB/s (4468kB/s), 4363KiB/s-4363KiB/s (4468kB/s-4468kB/s), io=1278MiB (1340MB), run=300001-300001msec

Disk stats (read/write):
    dm-49: ios=0/2113928, merge=0/0, ticks=0/329768, in_queue=329768, util=91.20%, aggrios=0/1432032, aggrmerge=0/682740, aggrticks=0/276444, aggrin_queue=497385, aggrutil=90.81%
  sdc: ios=0/1432032, merge=0/682740, ticks=0/276444, in_queue=497385, util=90.81%

Guest:
Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=3015KiB/s][w=753 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=475: Mon Nov 25 22:59:05 2024
  write: IOPS=817, BW=3269KiB/s (3347kB/s)(958MiB/300002msec); 0 zone resets
    clat (usec): min=887, max=28396, avg=1221.23, stdev=548.22
     lat (usec): min=888, max=28396, avg=1221.54, stdev=548.23
    clat percentiles (usec):
     |  1.00th=[  938],  5.00th=[  963], 10.00th=[  979], 20.00th=[ 1020],
     | 30.00th=[ 1045], 40.00th=[ 1074], 50.00th=[ 1090], 60.00th=[ 1123],
     | 70.00th=[ 1156], 80.00th=[ 1270], 90.00th=[ 1582], 95.00th=[ 2040],
     | 99.00th=[ 2474], 99.50th=[ 2671], 99.90th=[10159], 99.95th=[11863],
     | 99.99th=[17957]
   bw (  KiB/s): min= 1688, max= 3792, per=100.00%, avg=3271.11, stdev=267.70, samples=599
   iops        : min=  422, max=  948, avg=817.76, stdev=66.93, samples=599
  lat (usec)   : 1000=15.03%
  lat (msec)   : 2=79.16%, 4=5.53%, 10=0.18%, 20=0.10%, 50=0.01%
  cpu          : usr=0.41%, sys=3.41%, ctx=762874, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,245171,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=3269KiB/s (3347kB/s), 3269KiB/s-3269KiB/s (3347kB/s-3347kB/s), io=958MiB (1004MB), run=300002-300002msec

Disk stats (read/write):
  vdb: ios=73/801340, merge=0/511575, ticks=38/304685, in_queue=501296, util=90.80%
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!