Disk Speeds inside a VM.

Due to my knowledge it is limited by the single-core-speed of the processor yet. I tested also using a NVMe, Samsung SSD 970 EVO Plus 500GB , which is unexpected slow in this test (QLC and big internal blocks maybe), so the VM is as fast as native. (Debian 12, AMD Ryzen)

Host:
Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=588KiB/s][w=147 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=209549: Tue Nov 26 23:34:07 2024
  write: IOPS=203, BW=816KiB/s (835kB/s)(239MiB/300002msec); 0 zone resets
    clat (usec): min=1933, max=20006, avg=4900.50, stdev=1947.35
     lat (usec): min=1933, max=20007, avg=4900.99, stdev=1947.47
    clat percentiles (usec):
     |  1.00th=[ 2704],  5.00th=[ 2769], 10.00th=[ 2802], 20.00th=[ 2900],
     | 30.00th=[ 2999], 40.00th=[ 3195], 50.00th=[ 5866], 60.00th=[ 6259],
     | 70.00th=[ 6456], 80.00th=[ 6587], 90.00th=[ 6915], 95.00th=[ 7177],
     | 99.00th=[ 9634], 99.50th=[10945], 99.90th=[13566], 99.95th=[14222],
     | 99.99th=[16450]
   bw (  KiB/s): min=  520, max= 1448, per=100.00%, avg=816.48, stdev=328.66, samples=599
   iops        : min=  130, max=  362, avg=204.05, stdev=82.21, samples=599
  lat (msec)   : 2=0.01%, 4=46.87%, 10=52.38%, 20=0.75%, 50=0.01%
  cpu          : usr=0.17%, sys=1.79%, ctx=123590, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,61171,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=816KiB/s (835kB/s), 816KiB/s-816KiB/s (835kB/s-835kB/s), io=239MiB (251MB), run=300002-300002msec

Disk stats (read/write):
    dm-6: ios=0/374388, merge=0/0, ticks=0/298244, in_queue=298244, util=94.44%, aggrios=74/437647, aggrmerge=0/0, aggrticks=40/402580, aggrin_queue=402720, aggrutil=94.29%
    dm-0: ios=74/437647, merge=0/0, ticks=40/402580, in_queue=402720, util=94.29%, aggrios=74/424993, aggrmerge=0/12654, aggrticks=36/380708, aggrin_queue=492026, aggrutil=89.05%
  nvme0n1: ios=74/424993, merge=0/12654, ticks=36/380708, in_queue=492026, util=89.05%

Guest:
Code:
 $fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=680KiB/s][w=170 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=757: Tue Nov 26 23:39:48 2024
  write: IOPS=226, BW=906KiB/s (928kB/s)(265MiB/300005msec); 0 zone resets
    clat (usec): min=1815, max=22301, avg=4411.09, stdev=1864.26
     lat (usec): min=1815, max=22301, avg=4411.51, stdev=1864.31
    clat percentiles (usec):
     |  1.00th=[ 1991],  5.00th=[ 2073], 10.00th=[ 2147], 20.00th=[ 2245],
     | 30.00th=[ 2638], 40.00th=[ 2868], 50.00th=[ 5342], 60.00th=[ 5473],
     | 70.00th=[ 5735], 80.00th=[ 5932], 90.00th=[ 6259], 95.00th=[ 6587],
     | 99.00th=[ 8979], 99.50th=[10290], 99.90th=[12387], 99.95th=[13173],
     | 99.99th=[17695]
   bw (  KiB/s): min=  558, max= 1928, per=99.98%, avg=906.95, stdev=415.05, samples=599
   iops        : min=  139, max=  482, avg=226.71, stdev=103.78, samples=599
  lat (msec)   : 2=1.21%, 4=41.35%, 10=56.85%, 20=0.59%, 50=0.01%
  cpu          : usr=0.11%, sys=1.59%, ctx=179539, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,67964,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=906KiB/s (928kB/s), 906KiB/s-906KiB/s (928kB/s-928kB/s), io=265MiB (278MB), run=300005-300005msec

Disk stats (read/write):
  vdb: ios=16/206027, merge=29/142768, ticks=17/291859, in_queue=548255, util=96.05%
 
  • Like
Reactions: Johannes S
Last test, host on Debian trixie, so Qemu 9.1, without and with the new feature of multiple IO-threads per device. Disk is a file on an NFS in a different system, exported with async, so USV needed!

The device in qemu has iothread-vq-mapping added in the second test.

Part from command-line of qemu:

Code:
-blockdev {"driver":"file","filename":"/vmimages/root","node-name":"libvirt-2-storage","read-only":false,"cache":{"direct":true,"no-flush":false}}
-device {"driver":"virtio-blk-pci",
"iothread-vq-mapping":[{"iothread":"iothread1"},{"iothread":"iothread2"},{"iothread":"iothread3"},{"iothread":"iothread4"}],
"bus":"pci.5","addr":"0x0","drive":"libvirt-2-storage","id":"virtio-disk1","write-cache":"on"}

Without iothread-vq-mapping :

Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=1080KiB/s][w=270 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=507: Thu Nov 28 21:54:38 2024
  write: IOPS=284, BW=1140KiB/s (1167kB/s)(334MiB/300003msec); 0 zone resets
    clat (usec): min=1001, max=1329.4k, avg=3500.55, stdev=13418.76
     lat (usec): min=1002, max=1329.4k, avg=3501.77, stdev=13418.78
    clat percentiles (usec):
     |  1.00th=[  1188],  5.00th=[  1565], 10.00th=[  2540], 20.00th=[  2966],
     | 30.00th=[  3130], 40.00th=[  3261], 50.00th=[  3359], 60.00th=[  3458],
     | 70.00th=[  3556], 80.00th=[  3654], 90.00th=[  3851], 95.00th=[  3982],
     | 99.00th=[  5145], 99.50th=[  5604], 99.90th=[  6456], 99.95th=[ 72877],
     | 99.99th=[843056]
   bw (  KiB/s): min=    7, max= 3312, per=100.00%, avg=1168.31, stdev=318.65, samples=585
   iops        : min=    1, max=  828, avg=291.87, stdev=79.69, samples=585
  lat (msec)   : 2=6.40%, 4=88.73%, 10=4.80%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.54%, sys=4.26%, ctx=231047, majf=1, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,85476,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=1140KiB/s (1167kB/s), 1140KiB/s-1140KiB/s (1167kB/s-1167kB/s), io=334MiB (350MB), run=300003-300003msec

Disk stats (read/write):
  vdb: ios=0/261309, merge=0/178168, ticks=0/403720, in_queue=445327, util=87.53%


With iothread-vq-mapping :

Code:
$ fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 --end_fsync=1 && rm test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [w(1)][100.0%][w=1224KiB/s][w=306 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=512: Thu Nov 28 22:19:14 2024
  write: IOPS=358, BW=1432KiB/s (1467kB/s)(420MiB/300002msec); 0 zone resets
    clat (usec): min=907, max=1571.3k, avg=2784.75, stdev=14160.02
     lat (usec): min=907, max=1571.3k, avg=2785.81, stdev=14160.05
    clat percentiles (usec):
     |  1.00th=[   971],  5.00th=[  1012], 10.00th=[  1057], 20.00th=[  1221],
     | 30.00th=[  2376], 40.00th=[  2737], 50.00th=[  2868], 60.00th=[  2999],
     | 70.00th=[  3097], 80.00th=[  3228], 90.00th=[  3392], 95.00th=[  3556],
     | 99.00th=[  4686], 99.50th=[  5014], 99.90th=[  5932], 99.95th=[ 31065],
     | 99.99th=[943719]
   bw (  KiB/s): min=    7, max= 4024, per=100.00%, avg=1473.41, stdev=744.57, samples=583
   iops        : min=    1, max= 1006, avg=368.16, stdev=186.18, samples=583
  lat (usec)   : 1000=3.41%
  lat (msec)   : 2=23.53%, 4=71.25%, 10=1.74%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.59%, sys=4.59%, ctx=295965, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,107434,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=1432KiB/s (1467kB/s), 1432KiB/s-1432KiB/s (1467kB/s-1467kB/s), io=420MiB (440MB), run=300002-300002msec

Disk stats (read/write):
  vdb: ios=21/330686, merge=0/224819, ticks=25/392636, in_queue=426373, util=87.57%
 
  • Like
Reactions: Johannes S
So with all these tests, what does this actually mean ? I still don't understand why the SSD used as the disks for my VMs runs at about 25% of its speed inside the guest as opposed to on the host.
 
  • Like
Reactions: Johannes S
Well, why it's slow? Or is it just fast enough?
The test with fio uses random, synced, small writes for a long time. This load happens in reality only on high loaded SQL databases, which requires high consistency. But they do not write random, because new data goes first into a transaction-log, which is linear.

How much of your workload is like this? Data is more often read more than written, so caches help a lot, but gave less consisency at wriiting.

Disclamer: the following suggestions are about performance, not consistency , (high-)availability or simple and fast administration.

Know and keep your stack small. My stack from the first message was:

Code:
Host

ext4 filesystem
logical volume
volume-group
uefi-partition
sata
ssd

Guest

ext4 filesystem
virtual disk
virtio
logical volume
volume-group
uefi-partition
sata
ssd
from the second message:
Code:
Host

ext4 filesystem
logical volume
volume-group
uefi-partition
pcie
nvme

Guest

ext4 filesystem
virtual disk
virtio
logical volume
volume-group
uefi-partition
pcie
nvme
and from the third message:
Code:
ext4 filesystem
virtual disk
virtio
1Gbit Ethernet
nfs4
nfs-kernel-server
block cache (because nfs exported with async)
imagefile
ext4 filesystem
logical volume
volume-group
uefi-partition
sata sata
ssd  hdd

Each layer can cost performance, depending on the configuration and implementation. E.g. exporting NFS without "async" in the last example brings down the IOPS (in this test) from 250 to 10. Writing data twice (Raid, ZFS Checksums, Cephs) costs half of the disk-performance, or even copy data in RAM cost a lot of CPU time for waiting on RAM.

Each layer must support multiple CPU-cores and other parallelism. My last test shows the improvements using multiple iothreads in the virtio-driver.

Each layer must support the required features, eg. discard, to keep the flash memory fast. The "discard" (assuming fixed block-device sizes) is started a the top layer.

Even HDDs, SSD and NVMEs have an own firmware running (3 ARM-cores inside a Samsung MEX controller for example) and can help with cache (RAM or SLC) to improve short writes.
A dedicated NVME could be connected PCIe into the VM, using the IOMMU.

Benchmark yourself, and find the weak point!
 
Last edited:
personally with all of my tests, i have found that real world actions instead of benchmarks too is where you really see the loss and its also latency focused. one great example, indexing documents.

i have documents on my laptop and index them with a software called Recoll (great if you havent used it, like a search engine for your file contents)

on my laptop which has the documents stored on a seagate 2tb that was one of their flop drives that drops to 5mb/s under sustained load, it took 3-4 hours, which yeah thats slow. but in a vm? it took more than 4 days. the latency lag adds up terribly. (the host for proxmox also has enterprise drives and way more power/performance of course, it should have been done in an hour or two tops, not days)

this is just one example. performance loss inside vms is actually massive no matter how you configure it, it seems.

apart of this i feel like might be because of the latency lag, the queue fills up and the OS running inside the VM becomes overwhelmed easily if you are using SATA drives and not a sas raid or something.

with NVME it appears to be another issue related to the driver if you do not pass the whole nvme to the guest OS and we possibly need a new NVME controller/driver optionin proxmox instead of scsi since they work differently and i think its possible going from guest > scsi driver > host > nvme driver could be causing some of these issues.

(if anyone cares to test passing the whole nvme to guest i will bet you will lose these issues)

it is painfully apartent in my experience at least you see windows flicker not responding a lot, glitches, even tiny loads cause unresponsiveness, etc
 
i have documents on my laptop and index them with a software called Recoll (great if you havent used it, like a search engine for your file contents)

on my laptop which has the documents stored on a seagate 2tb that was one of their flop drives that drops to 5mb/s under sustained load, it took 3-4 hours, which yeah thats slow. but in a vm? it took more than 4 days. the latency lag adds up terribly. (the host for proxmox also has enterprise drives and way more power/performance of course, it should have been done in an hour or two tops, not days)

Recoll seems do to I/O in a single thread, so it is affected by increasing latency, while the fio-tests from @Monstrous1704 in this thread run 4 requests in parallel. Parallel requests on a SSD or NVME are even necessary in tests on "bare metal" to get the maximum IOPS.
The "dd" test @rofo69 started, is also single-threaded.
I did not expect myself, that it make such a big difference, but since starting virtualization 25 years ago HDD were common, and the I/O was not limited by using a single thread. Today even the Linux kernel switched from synchronous to asynchronous in the blocklayer (like network) to catch up with NVMe, and Qemu 9 improves parallelism.
 
Recoll seems do to I/O in a single thread, so it is affected by increasing latency, while the fio-tests from @Monstrous1704 in this thread run 4 requests in parallel. Parallel requests on a SSD or NVME are even necessary in tests on "bare metal" to get the maximum IOPS.
The "dd" test @rofo69 started, is also single-threaded.
I did not expect myself, that it make such a big difference, but since starting virtualization 25 years ago HDD were common, and the I/O was not limited by using a single thread. Today even the Linux kernel switched from synchronous to asynchronous in the blocklayer (like network) to catch up with NVMe, and Qemu 9 improves parallelism.
thank you for pointing that out, i will have to modify it and see if i cannot get a slight boost out of it, as i am currently using the windows version and only the linux and osx versions are listed as using it multi thread by default.

this showcases pretty well the latency impact, my laptops seagate drive is junk and so very slow, its an st2000Lm015 i believe SMR drive with a hardware flaw i forget what it is specifically, but it just happens to be one of the seagate flop drives that shipped out with a hardware fault, the drive goes abysmally slow under load after a few seconds and will drop to sustained read/write of 4-5mb/s with pretty high latency and it is running 100x faster than the vm latency wise and that is just sad since proxmox is running on a nvme, vm on an nvme, storage drive running off of an HGST enterprise drive that hits 250mb/s and is fairly low latency with no other load on it except the vm itself accessing a virtual disk it is hosting for storage.

is it possibly a an isolation feature or something in proxmox causing this? or are there any ways to give the guest more so direct access without passing the device/controller to the vm?
 
Last edited:
It's not PVE specific , I've just done some test with CrystalDiskMark (worst case Q1T1 then max IOPS Q32T16 )
Hyper-V is equal. Tried on VMware Workstation too, little better, Reads even better but I guess a cache fake it.
Screenshots attached.
on HPE DL20 Gen11 / 1 x SATA SSD 480GB HPE Read Intensive (not yet look at the vendor)
 

Attachments

  • CrystalDiskMark CDM Bench HPE DL20 Gen11 SSD SATA Baremetal Windows + Hyper-V + Workstation + ...png
    CrystalDiskMark CDM Bench HPE DL20 Gen11 SSD SATA Baremetal Windows + Hyper-V + Workstation + ...png
    91.6 KB · Views: 9
is it possibly a an isolation feature or something in proxmox causing this? or are there any ways to give the guest more so direct access without passing the device/controller to the vm?

Proxmox is a manager for Qemu. I/O in Qemu needs additional CPU for copying data from the VM to the host and back. How many data can a CPU copy? RAM is about some 10GB/s, data needs to be read and written, addtional to a copy to an area DMA can reach.

For copying this data more than one CPU-core is needed, but Qemu until version 8.x can not use multithread per virtual device for this purpose, only a shared or one dedicated thread per device. Qemu 9.x could, using the apropriate configuration. To have the highest performance multiple iothreads per device are needed and also binding threads on the same CPU, allowing to share the CPUs internal cache between the thread running a vCore and the thread copying the data. So a lot of available CPU-time is needed to get good results.

For details see the blog http://blog.vmsplice.net/ or the slides, I already linked.

On bare-metal, the data goes also from the CPU-cache to the PCIe-bus and disk.

So with SATA disks, limited to about 500 MB/s, one CPU core may manage to copy the data without adding to much latency, and 50 or 75 % can be reached. For enterprise NVMEs having a RAM-cache, a single CPU core and ordinary RAM is just not fast enough. I can not test myself, as I have not apropriate hardware available.

You can run top -H -p ... in the host on the qemu-process to see the threads working and may see the limiting thread.
 
The fio-test Qemu 9.1 on Debian 13 needs 6 to 8% Host-CPU (sys and usr added, from top -1) running on the host and 30 to 40% running in the guest while having 323 and 221 IOPS. This test uses only one iothread, so Qemu 9.1 does not really help in this test.
(CPU is Haswell and some SATA SSD)
So the CPU limits also the I/O-performance.
 
So according to 'kvm --version', my qemu version is:-

QEMU emulator version 9.0.2 (pve-qemu-kvm_9.0.2-4)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

how would I go about configuring proxmox to use multiple threads to help increase the disk throughput on the VM ?

I've tried changing the AIO = threads (from the default io_uring, but thats made no tanigle difference to disk speeds at all.

I also have IOthreads check box enabled... no difference.
 
Last edited:
So according to 'kvm --version', my qemu version is:-

QEMU emulator version 9.0.2 (pve-qemu-kvm_9.0.2-4)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

how would I go about configuring proxmox to use multiple threads to help increase the disk throughput on the VM ?

I've tried changing the AIO = threads (from the default io_uring, but thats made no tanigle difference to disk speeds at all.

I also have IOthreads check box enabled... no difference.
Proxmox does not have GUI for the new Qemu multi io thread yet, I think you would need to manually configure the file to test it.

Now that Qemu 9.0x is in PVE8.3 it's on my list of things to test
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!