>10x worse random I/O performance within Windows guest using "host" vs. native on host hardware

sbrocket

New Member
Jun 23, 2025
8
0
1
I've been setting up a new Windows 11 guest and struggling to debug poor I/O performance within the guest. Qualitatively, "real usage" I/O performance in the guest seems sluggish, with apps launching slowly and operations that are snappy on similarly spec'd non-virtualized Windows devices seeming to have noticeable latency. Quantitatively I used CrystalDiskMark within the guest to get some basic measurements of performance, and it appears that high queue depth random read performance (RND4K QD32T1) is where the guest is struggling in particular.

I've tried a number of different configurations now in an attempt to isolate the most relevant variable:
  • different NVMe drives, swapping out the Corsair P3 Plus QLC drive I'd initially repurposed for a Samsung 990 PRO, without significant improvement,
  • VM images on ZFS vs LVM with both thin and thick provisioning, which made slight but not significant differences probably within measurement error,
  • various guest settings other than the generally recommended Virtio SCSI Single w/ IO Thread, all of which were worse
My 2nd most recent test was to clone the guest's drive from the VM image over to a dedicated NVMe drive and pass that entire drive through to the guest using PCI passthrough. The goal of this test was to isolate the Windows Virtio drivers as a potential cause; given all the other variables I'd eliminated to this point, some issue with the Windows Virtio driver seemed like the next most likely guess. But then that test showed similar performance between Virtio SCSI and the standard Microsoft NVMe driver, so still no hint as to root cause.

Since at this point I had the Windows boot drive cloned over to a physical NVMe drive (the Corsair P3 Plus I'd had spare), the next test was to boot the same drive used within the Proxmox guest on the host hardware natively. That test is the first one to show a significant difference on the RND32 QD32T1 test, and a massive difference at that; on reads, ~60-70 MB/s within the guest and ~800MB/s for the same Windows install booted natively on the same hardware. Sequential reads and writes show similar results between native and guest, no worse than I might expect from virtualization overhead, but random read and write performance is horrible within the guest.

cdm_guest.png cdm_native.png

Note that this test was without Virtio, ZFS/LVM, or any of that in the path to the drive; the entire NVMe drive is passed through and using the default Windows NVMe driver. (That said, these guest results are about the same as previous tests using Virtio SCSI with a raw image either in a ZFS Zvol or an LVM LV on the same NVMe drive.) The Windows guest run was performed with all other guests stopped and with all 28 vCPUs available to the Windows guest. The native test was performed immediately after the guest test by simply shutting down the guest and Proxmox and then booting into the same drive natively. No changes other than what Windows might perform automatically due to detecting hardware changes, but this was not the first time booting the drive natively and it did not seem to make any changes in between these runs like it did on the first native boot.

This seems to imply that something about the virtualization itself is impacting the random read I/O performance of the Windows guest. I haven't seen this noted clearly as expected behavior anywhere else (though there are of course lots of scattered reports of poor I/O performance that often lack details). While I might expect some performance impact due to the inherent overhead of virtualization, this difference seems much too large to be that alone...right?

Is this at all expected? Any tips on what to try next? I'm stumped for how to diagnose this further. I'm just a prosumer homelab type user with flexible needs, but this is still significant enough that it may rule out virtualizing Windows completely for me. It's hard for me to believe that people would be happy with such a drastic I/O performance hit or that this is typical of VM performance in 2025.

Config for Windows guest below. The two unused disks are old clones from previous tests, hostpci0 is an Nvidia GTX 1080 GPU that is passed through to the guest (irrelevant as far as I know but noted for completeness), and hostpci1 is the NVMe boot drive. The physical CPU is a single Intel i7-14700k. The one remaining significant difference between guest and native would be that the guest is configured with 24GiB of RAM and the native has the hardware's full 128GiB available, but it's hard to believe that's relevant when the guest has 24GiB and was booted shortly before this test.

Code:
❯ qm config 103
agent: 1
bios: ovmf
boot: order=hostpci1
cores: 28
cpu: host
efidisk0: local-zfs:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:01:00,pcie=1,x-vga=1
hostpci1: 0000:04:00,pcie=1
machine: pc-q35-9.2+pve1
memory: 24576
meta: creation-qemu=9.2.0,ctime=1750044503
name: WinVM
net0: e1000e=BC:24:11:73:12:E7,bridge=vmbr0
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=1d966852-58ee-45dd-a184-08ca499493ae
sockets: 1
tpmstate0: local-zfs:vm-103-disk-2,size=4M,version=v2.0
unused0: local-zfs:vm-103-disk-3
unused1: local-lvm:vm-103-disk-0
vmgenid: c28ad20a-b7a4-49e8-9f7f-21aac8b41c78
 
Last edited:
I stumbled across https://forum.proxmox.com/threads/t...-of-windows-when-the-cpu-type-is-host.163114/ shortly after posting this and gave changing my CPU type to x86_64-v3 a try, and that seems to be the ticket! I/O performance in the guest is up to near-native performance after changing the guest's configuration from "host" to "x86_64-v3", which is the highest my i7-14700k seems to support.

cdm_guest_x86_64-v3.png

It's a bit frustrating that the documented best practices point you in the complete opposite direction from this, at least if you're interested in using WSL within the VM like I am! https://pve.proxmox.com/wiki/Windows_11_guest_best_practices That said, using "x86_64-v3" does seem to have broken nested virtualization for a WSL install that was working just fine with "host", so I still have that problem to solve now. o_O
 
Last edited:
That said, using "x86_64-v3" does seem to have broken nested virtualization
it's expected. x86_64-* will never enable "by default" nested virtualization.
Nested virtualization is not the most use case, even more, I think, these are few use cases.
You can create your own custom cpu-model :
 
it's expected. x86_64-* will never enable "by default" nested virtualization.
Nested virtualization is not the most use case, even more, I think, these are few use cases.
You can create your own custom cpu-model :

Yep I discovered that, thanks!

I'm a bit surprised that nested virtualization seems to be treated like an exotic or at least uncommon use case, honestly, but I suppose that's a narrow view based primarily on WSL being a tool I reach for frequently on Windows installs. It's not that surprising that nested virtualization use cases other than WSL are uncommon.

I did see that thread and am going to give creating a custom CPU model based on either x86_64-v3 or Skylake-Client-v4, which is a closer match for my host processor, a try. The main question I have now is whether "+vmx" alone like that thread suggests is likely to have little downside, or whether it's worth going beyond that and trying to also match the vmx flags from my host processor.