Hi all, this is a result of my deep analysis of this problem within the last few days.
Just a quick recap for newcomers - if you're affected by sudden Windows VM hangs, reboots, file systems errors and related I/O stuff, especially with messages like:
Code:
PVE host:
QEMU[ ]: kvm: Desc next is 3
QEMU[ ]: kvm: virtio: zero sized buffers are not allowed
Windows guest:
Warning | vioscsi | Event ID 129 | Reset to device, \Device\RaidPort[X], was issued.
Warning | disk | Event ID 153 | The IO operation at logical block address [XXX] for Disk 1 (PDO name: \Device\000000XX) was retried.
this thread and post is probably for you (under specific circumstances).
I'm quite confident with my findings, as they are reproducible, (almost) deterministic and analyzed/captured in three different setups.
At first -
there is (probably) nothing wrong with your storage (as some posts are trying to tell you, though in good faith - here or in github threads).
If storage stress-tests work with correct drivers, VirtIO Block or non-virtualized, all recommendations about modifying some windows storage timeouts seem to be unfounded and doesn't make any sense to me. All my tests have been done with local storage and all are affected.
No iSCSI, no network storage, even no RDMA storage here. I'm absolutely sure regarding this.
TLDR:
- It's all about drivers.
- VirtIO SCSI drivers (vioscsi) up to 0.1.208 are stable. I can say that 0.1.208 (virtio-win-0.1.208.iso) is super stable. No problems so far, with extensive testing. But it's sufficient to apply only the "vioscsi" driver, not all other ISO drivers (i.e. SCSI driver downgrade is needed only).
- Drivers since 0.1.215 (virtio-win-0.1.215.iso) are crippled. Howgh. Note - this is the first ISO version (from January 2022) where dedicated drivers for Windows 2022 were included (although it's only separate folder and it's binary identical version as for Windows 2019).
- The most dangerous config is VirtIO SCSI Single with aio=io_uring (i.e. default). I'm able to crash it almost "on-demand" on each of my three independent setups.
If you're affected,
these are the steps to become stable:
- Update (i.e. downgrade) your "vioscsi" driver to 0.1.208. It's a definitive solution.
- If you cannot downgrade „vioscsi" driver (e.g. too many VMs, etc.), switch to VirtIO Block - at least for problematic „data" drives.
You'll lose some performance (but max. 20% in general with SSDs and only under massive random load - many threads, high QL/QD), but don’t worry - there’re is a performance boost ready for QEMU 9.0 (more at the end of this post). It's also a definitive solution.
- Get rid of „io_uring“ until the bug is fixed. Switch to „Native“ and if it doesn’t work, switch also to the base VirtIO SCSI (not „Single“). But wait - this is just a partial mitigation, not solution, but it may help a lot.
Note: switching to VirtIO Block or to base VirtIO SCSI implies storage change to the Windows guest. As a result, you'll probably need to switch affected disks back to online after the first boot.
And now a deeper dive.
I’ve three identical VMs.
VM101, VM102 and VM103.
They are cloned from the original VM.
Windows Server 2022 ISO has been used for installation (updated 5/5/2023), no online Windows updates applied.
CrystalDiskMark for x64, version 8.0.5 (URL:
https://sourceforge.net/projects/crystaldiskmark/files/8.0.5/CrystalDiskMark8_0_5.exe/download).
Full VirtIO driver set installed (virtio-win-gt-x64.msi) + QEMU agent (qemu-ga-x86_64.msi),
version 0.1.240 (
virtio-win-0.1.240.iso).
Initial config of each VM is as follows:
Code:
bios: ovmf
boot: order=scsi0
cpu: x86-64-v2-AES
machine: pc-q35-8.1
memory: 8192
net0: virtio=<MAC>,bridge=vmbr0,firewall=1,tag=<VLAN>
agent: 1
ostype: win11
efidisk0: <path>:vm-XXX-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
scsi0: <path>:vm-XXX-disk-1,aio=io_uring,iothread=1,size=40G,ssd=1
scsi1: <path>:vm-XXX-disk-2,aio=io_uring,iothread=1,size=80G,ssd=1
scsihw: virtio-scsi-single
sockets: 1
cores: 4
numa: 0
scsi0 is „system disk“ (with Windows installed),
scsi1 is „data disk“, where the CrystalDiskMark tests are performed (with default NTFS format).
During the tests, data disk "aio" is subject of change, as well as SCSI Controller (switch from SCSI Single to SCSI base).
Each VM is running on different hardware (all SSDs are Enterprise / Datacenter editions):
- VM101 - AMD EPYC3 CPU (74F3), local ZFS RAID 50 (2x RAID-Z1 across 8 SATA SSDs), repo PVE Enterprise (with QEMU 8.1), low or medium load
- VM102 - AMD EPYC3 CPU (7343), LVM on local HW RAID - Broadcom MR9460-16i (RAID 5 across 8 SATA SSDs, stripe size 64kB), repo PVE Enterprise (with QEMU 8.1), empty node
- VM103 - ancient Intel XEON E5-2667 v4 (dual-socket), LVM on local HW RAID - Areca ARC-1226-8i (RAID 5 across 8 SATA SSDs, stripe size 64kB), repo PVE No-subscription (with QEMU 9.0), for failover and testing only
I’m able to demonstrate this issue on each VM, i.e. on three different servers with PVE, although the sensitivity is different.
Workload type -
CrystalDiskMark is used, not as a benchmark here, but as a disk stress-tool only.
For this use-case it’s very sufficient, but some important settings required:
- Use „NVMe SSD“ mode (menu Settings) for higher random load (up to Q32T16)
- For all tests, use Disk D only (aka scsi1 = „data disk“)
- Use default 5 repetitions (it’s usually sufficient)
- Use sample file size 8 - 32GB. Start with 8GB, but higher data size does not mean higher probability of the issue
Fresh VM boot, > 5 mins. without activity.
These are some typical results (within the first run):
Scenario 1 -
aio=io_uring, controller
SCSI Single with „IO Thread“
This is the most buggy combination. Probability of the crippled (i.e. hanged) test is very high (> 80% for each VM, but 100% across all three VMs)
VM101: https://i.postimg.cc/nrW05Yjn/VM101-SCSI-Single-IO-uring.png
VM102: https://i.postimg.cc/RVVWbx4C/VM102-SCSI-Single-IO-uring.png
VM103: https://i.postimg.cc/8zLgRs97/VM103-SCSI-Single-IO-uring.png
Scenario 2 -
aio=native, controller
SCSI Single with „IO Thread“
This is a more stable combo, but still there’s at least one VM crippled and total probability of hang is > 70%
VM101: https://i.postimg.cc/gk20S88L/VM101-SCSI-Single-Native.png
VM103: https://i.postimg.cc/zfC5XGT1/VM103-SCSI-Single-Native.png
VM102 was OK in this run
Scenario 3 -
aio=native, controller
SCSI only (not Single, where „IO Thread“ is useless)
Decent combo, but still there are some crashes and total probability of hang is approx. 50-70%
VM102:
https://i.postimg.cc/JnR8yCQH/VM102-SCSI-Basic-Native.png
VM103: https://i.postimg.cc/FRG43CPW/VM103-SCSI-Basic-Native.png
VM101 was OK in this run
Scenario 4 -
aio=io_uring, controller
SCSI Single with „IO Thread“ (i.e. back to default Scenario 1), but with
„vioscsi“ driver 0.1.208 used
Super-stable combo (the same results for aio=native, SCSI only, etc., doesn’t matter).
I wasn’t able to break 0.1.208 driver at all, but I had one test with 0.1.204 where the issue was present, but surprisingly, after some time it was able to continue (aka test resurrection) and finished in normal way (which was really interesting and very rare).
VM102 as an example of a successful test: https://i.postimg.cc/hvF8ngW6/VM102-SCSI-Single-IO-uring-0-1-208.png
Notes to the pictures:
- Each atomic test starts with the creation (i.e. write) of the random test file of selected size.
On random access tests (with high QD/QL) it's usually visible as a first CPU peak.
Under sequential tests no peaks are usually visible.
- When it freezes (i.e. vioscsi hang), there's a drop of CPU load to "zero", bandwidth (on data disk) is also zero, but utlization of disk D is still 100% (of Active Time). But there're cases, where the Resource Monitors is suddenly crippled by this buggy IO behavior, and is doing something for dozens of seconds (like the case in Scenario 1, VM 103 - see the last higher CPU load).
- Almost in all cases there's a problem with Read or Write on RND 4kB Q32T16 test and the corresponding (last frozen) text in the window is "Random Read (X/5)" or "Random Write (X/5)", so the high QD (512) is a proven killer. And for example, if you see "Random Write (2/5)", there was a problem during (or at the end) of 2nd attempt and you see 3 CPU peaks (with the initial one). But it can also mean that 3rd attempt was not successfully started (what is the usual case).
Notes to the tests:
- If your test will crash even with simple sequential IO, do vioscsi downgrade to 0.1.208 ASAP.
- After each hang or crash, VM "Hard Stop" is required to get rid of this bug, along with next fresh boot.
- If you'll kill the VM within some short time after hang, there'll probably be no error on KVM/QEMU, but even Windows Event Log needs some time to catch this bug (but usually there are 2-3 warnings regarding "vioscsi" with Event ID 129).
- Contrary to some other reports, I've never registered a "broken" hypervisor, so every hang or crash was isolated and never affected other running VMs (luckily, but other VMs are on shared storage, not local).
- In the first batch of my tests, scsi0 (system disk) was on aio=native, but the incidence was the same. But I highly recommend the system drive to be on "native", at least for this reason: the bug will be captured in the Event Log with higher probability (and the performance drop is none or very low, for such usage). The worst thing you can encounter is there are no logs, as the system drive was hanged too (or the only).
- I've also done some tests with "aio=io_uring" without "IO Thread" (on SCSI Single), but the incidence was roughly the same, so I've removed this combo from other extensive tests.
Long story short - what are my tests showing:
- There’s nothing wrong with your storage.
- There’s nothing wrong with Proxmox itself.
- There’s no difference between Intel and AMD.
- It’s not true that this issue is common on network storages only. This is the definitive rebuttal. Local incidence is very high, across three different and independent servers / VMs.
- The root cause is some change in the „vioscsi“ driver stack between versions 0.1.208 and 0.1.215. Here everyone should look first in my opinion. Even the newest driver (0.1.262) is still buggy.
So, there are no ghosts. For me it’s very easy to demonstrate the difference of stability between these two driver versions.
I’m quite sure that this problem is much frequent than one may expect. The difference in incidence is the load, mainly very high QD/QL (IO queue depth / length), as the
random R/W test with Q32T16 is the „serial killer“ here (with effective QD of 512). And this is the reason why especially DB servers are affected first (as such high QD is not generally common, as well as synchronous writes are required).
All the problems regarding „io_uring“, reported by the community, are probably true. They are not fairytales.
But
„io_uring“ is just a most visible „victim“ of this vioscsi driver bug/change. So the problem is not the „io_uring“ itself. With correct driver it’s super-stable and best performer for very high random/transaction load (in the rank of Q32T16, for example). So generally it’s good that it’s a default for PVE, but not with buggy driver.
Nice example of such thread is here:
- starring
@chrcoluk:
https://forum.proxmox.com/threads/p...ive-io_uring-and-iothreads.116755/post-528359 - „My only feedback of significance here is I now routinely switch away from the default io_uring, was found to be the cause of data corruption in my windows guests.“
- and
@bbgeek17:
https://forum.proxmox.com/threads/p...ive-io_uring-and-iothreads.116755/post-528521, with this answer: „Our guidance towards aio=native is based on a history of anecdotal reports like yours.“
One may ask about the tests with
aio=threads. It’s also widely covered in that thread, but in general: It’s a good workaround-like solution, usually topping sequential performance, but for random access is not ideal, as its real performance depends on system load, CPU cores count, etc. While I don’t focus on it, it may help in some scenarios, but it's not reproducible very well.
Last but not least: I’m quite sure that my findings are very close to the real root cause (i.e. driver bug/change), but
I can’t extrapolate my knowledge to every use-case in the wild. So you can try my way and let all of us know. Of course, in the world of drivers, kernel, open-source, etc., „the truth“ can always be reverted: „up to 0.1.208 there was a bug, which was since 0.1.215 resolved“. So I’m not hunting for the „culprit“, I just want to move forward quickly out of this serious bug, which is the only „not enterprise-ready" bug in the PVE world I know (in other words it’s a show-stopper for many enterprise setups, if you don’t have a stable solution without many drawbacks).
Disclaimer: I’m posting this after three independent and extensive test runs, with the „hype-free“ pause between. Results and observations are still the same, no serious volatility has been observed. And a fun fact: no animal was hurt, but many NAND cells probably died. In total, I’ve run more than 50 tests, and each test was processed with 5 repetitions with sample size of 8, 16 or 32GB per atomic test (
SEQ 1MB Q8T1,
SEQ 128kB Q32T1,
RND 4kB Q32T16,
RND 4kB Q1T1 for Read and Write), i.e. dozens of TBs was written.
And the bonus topic -
VirtIO Block is not dead! I’m very happy that PVE adopted
QEMU 9.0 in the non-enterprise repos so far (
https://forum.proxmox.com/threads/qemu-9-0-available-on-pve-no-subscription-as-of-now.149772). But in my opinion, one important option in PVE is missing, as one of the greatest advancements of QEMU 9.0 is that
„virtio-blk now supports multiqueue where different queues of a single disk can be processed by different I/O threads“ (
https://www.qemu.org/2024/04/23/qemu-9-0-0). It works in addition to „IO thread“, but it’s needed to activate it in arguments/options - see
iothread-vq-mapping in „kvm -device virtio-blk-pci,help“
For more, see:
https://patchwork.kernel.org/project/qemu-devel/cover/20231220134755.814917-1-stefanha@redhat.com
up to:
https://patchwork.kernel.org/project/qemu-devel/patch/20231220134755.814917-5-stefanha@redhat.com
„Note that JSON --device syntax is required for the iothread-vq-mapping parameter because it's non-scalar.“
Update: Added example of a successful test (see Scenario 4) + Notes to the tests
CC
@fiona,
@fweber and
@t.lamprecht