Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

benyamin · Apr 30, 2024

benyamin said:
The PhysicalBreaks registry entry actually needs to be set to less than the max_segments size for each backing block device.
This can be determined by issuing the command grep "" /sys/block/*/queue/max_segments.
If, for example, your block device has a max_segments of 60, it should be set to no more than 59, but one might consider setting it to 32 instead.
More info: https://github.com/virtio-win/kvm-guest-drivers-windows/issues/827

This could be programmatically determined and applied to the qemu command line, per this mention.

Max2048 · Apr 30, 2024

IIRC the proxmox developer @fweber is in contact with Redhat developer vrozenfe (https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756). Is there any progress you can share with us?

benyamin · May 1, 2024

Max2048 said:
IIRC the proxmox developer @fweber is in contact with Redhat developer vrozenfe (https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756). Is there any progress you can share with us?

Also @fiona in https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623.

In that issue, a mention of downgrading to version 0.1.204 sometimes resolved the problem.

It does appear that the problem is universally solved by using aio=threads and iothreads=1 per this bug report.

Max2048 · May 1, 2024

The "universal workaround" seems to only apply to local storage backends, but NOT CEPH. Thanks for the link though.

benyamin · May 1, 2024

Max2048 said:
The "universal workaround" seems to only apply to local storage backends, but NOT CEPH. Thanks for the link though.

lulz, yes, universally solved for local storage. 8^d
For CEPH, I guess a workaround could be to use LSI or maybe even virtio-block.
I think there is also a CEPH HA iSCSI implementation. I'm not sure if it does active/active so YMMV. I use tgt in an active/active configuration, and it works ok. In my experience, LIO is usually only compiled in kernel as active/passive. VMs are then configured for iscsi-boot. Using virtio-scsi, et al., would be A LOT simpler and permit you to retain Proxmox disk management integration, including PBS I imagine.

benyamin · May 1, 2024

In my environment, It seemed to me that VMs were more likely to fail with each additional iothread, i.e. each additional disk iothread=1 and a VirtIO SCSI single adapter. There seems to be a proportional, perhaps even a linear relationship, between the number of iothreads and the probability of failure. I also, had one VM using a PVSCSI adapter, which was even more likely to fail.

Further reference (cross post): https://forum.proxmox.com/threads/h...-all-vms-using-virtio-scsi.124298/post-659717

fiona · May 2, 2024

For those people that run into this error message during/after backup, Proxmox VE 8.2 has the possibility to do backup fleecing, which can be configured node-wide in /etc/vzdump.conf or for a given backup job in the Advanced tab in the UI, when editing the job. With that, there should be no requests that will complete after the VirtIO driver expects.

If the issue occurs independently of backups, it might be the other kind of issue @fweber could reproduce https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756#issuecomment-2012089151

Max2048 · May 2, 2024

Yes. for me it's indepdently of backups. Interestingly the VM can hang when there is "medium" I/O like 50 MB/s, but during backups with 400-500 MB/s it's fine.

RoCE-geek · Jul 30, 2024

Hi all.

I can confirm that this issue is not related to backup at all. It's related only to high load (bandwidth / IOPS) of the storage.

We are approx 2 weeks on Proxmox in production (with subscription), migrated from Hyper-V (still WIP).

We only use high-end EPYCs (EPYC3 + EPYC4 with top clocks models).

Storage consists of RDMA/RoCE/NVMe-over-Fabric Huawei Dorado V6 (200.000 IOPS with sub-ms latency).

Each host has 2x25GbE for VM/LAN and 2x25Gb RoCE for storage. NIC type is Mellanox ConnectX-6 Lx.

Majority of the VMs are Windows Server 2019/2022 with this typical pseudo-setup:

Code:

bios: ovmf
boot: order=scsi0
cpu: x86-64-v2-AES
machine: pc-q35-8.1
net0: virtio
sockets: 1
numa: 0
ostype: win11
scsi0: Shared LVM/system-raw-disk, iothread=1, ssd=1, aio=default (io_uring)
scsi1: Shared LVM/data-raw-disk, iothread=1, ssd=1, aio=default (io_uring)
scsihw: virtio-scsi-single

Guest drivers and QEMU agent installed from virtio-win-0.1.240.iso

No problems so far.

Last saturday (2024-07-27) I've migrated one VM to the local storage. This is only one exception - all other VMs are on the shared RDMA storage.
But this one VM is a huge data-pump (BI/DW/OLAP) and makes no sense to have it on the shared storage, as it's non-critical and non-HA approved.
At the same time, IOPS and bandwidth measured on the shared storage were quite high for this VM - up to 20.000 IOPS and 2GB/s.
This local on-host storage is ZFS RAID 50 (2x RAID-Z), i.e. 2x 4 SATA SSDs. In other words, appropriate storage for this use-case.
So, on this local ZFS RAID there is only 1 VM - with system and data disk.

On Monday, there was a freeze of this machine - unresponsive / crippled.

On the Proxmox host, there was only this one line:

Code:

Jul 29 07:43:37 [hostname] QEMU[5986]: kvm: virtio: zero sized buffers are not allowed

On the Windows 2022 guest, there was such a report in the event log:

Code:

Warning | 29.07.2024 | 7:43:37 | vioscsi | Event ID 129 | Reset to device, \Device\RaidPort2, was issued.

and this line was repeated for every minute until hard-stop/reset was forced.

Only change I've done as a 1st approach so far was switch from "aio default (io_uring)" to "aio threads" for the data disk (scsi1).

There were no problems today, but it's too soon to be able to confirm this workaround is effective/resilient in our setup.

I think there is nothing new in this my report, but I just wanted to keep focus on the root cause.

So, it's not PBS/backup related, there are simply some virtio problems with Windows guests under high storage load.

But there are more pending questions - e.g. can it be related to q35/QEMU version as well?

I've seen on no-subscription channel that there is already QEMU 9.0 version, for example.

I'll post when the guest will have another problem after recent switch to "threads".

Finally, I would like to thank everyone who is interested in this serious issue.

Whatever · Jul 30, 2024

From my perspective virtio scsci driver is the key here. Still waiting for any progress from virtio devs. Be free to ping them at the corresponding topic at the github (check the link in messages above)

RoCE-geek · Jul 31, 2024

Whatever said:
From my perspective virtio scsci driver is the key here. Still waiting for any progress from virtio devs. Be free to ping them at the corresponding topic at the github (check the link in messages above)

Hi @Whatever , it really seems to be related to virtio scsi, but probably more specifically to VirtIO SCSI single, as there are some posts saying that the simple presence of this controller may cause (some) problems.

As for the original virtio github, I think @fweber from Proxmox is already doing an excellent work/research/posts there.

So I'm not sure if I can help with anything new or original there - there is more than one reproducer mentioned and it seems to be heavily related to the specific virtio windows drivers.

benyamin · Jul 31, 2024

It's worth mentioning the issue is not reported in RHEL or Nutanix based QEMU environments regardless of virtio version.

It appears to be specific to ~~Proxmox-patched~~ Debian QEMU environments.

Perhaps it is due to PBS considerations in PVE...? Or something more generic in the Proxmox io_uring implementation?

EDIT: Changed reference to Proxmox to Debian instead.

fiona · Jul 31, 2024

Hi,

benyamin said:
It appears to be specific to Proxmox-patched QEMU environments.

there are reports in the Github issue with non-Proxmox environments, e.g.: https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756#issuecomment-2212543280

It's likely that there is some factor in Proxmox making the issue more likely, but it's currently not clear what.

benyamin · Jul 31, 2024

fiona said:
..there are reports in the Github issue with non-Proxmox environments...

I updated my post to refer to Debian instead. I noticed Fabien's (GreyWolfBW) kernel and kvm were quite downlevel too.

The date of the driver change is somewhat coincident with the Bullseye release. Maybe this is an upstream problem rather than a driver problem?

RoCE-geek · Jul 31, 2024

RoCE-geek said:
Hi all.

I can confirm that this issue is not related to backup at all. It's related only to high load (bandwidth / IOPS) of the storage.

We are approx 2 weeks on Proxmox in production (with subscription), migrated from Hyper-V (still WIP).

We only use high-end EPYCs (EPYC3 + EPYC4 with top clocks models).

Storage consists of RDMA/RoCE/NVMe-over-Fabric Huawei Dorado V6 (200.000 IOPS with sub-ms latency).

Each host has 2x25GbE for VM/LAN and 2x25Gb RoCE for storage. NIC type is Mellanox ConnectX-6 Lx.

Majority of the VMs are Windows Server 2019/2022 with this typical pseudo-setup:

Code:

bios: ovmf boot: order=scsi0 cpu: x86-64-v2-AES machine: pc-q35-8.1 net0: virtio sockets: 1 numa: 0 ostype: win11 scsi0: Shared LVM/system-raw-disk, iothread=1, ssd=1, aio=default (io_uring) scsi1: Shared LVM/data-raw-disk, iothread=1, ssd=1, aio=default (io_uring) scsihw: virtio-scsi-single Guest drivers and QEMU agent installed from virtio-win-0.1.240.iso

No problems so far.

Last saturday (2024-07-27) I've migrated one VM to the local storage. This is only one exception - all other VMs are on the shared RDMA storage.
But this one VM is a huge data-pump (BI/DW/OLAP) and makes no sense to have it on the shared storage, as it's non-critical and non-HA approved.
At the same time, IOPS and bandwidth measured on the shared storage were quite high for this VM - up to 20.000 IOPS and 2GB/s.
This local on-host storage is ZFS RAID 50 (2x RAID-Z), i.e. 2x 4 SATA SSDs. In other words, appropriate storage for this use-case.
So, on this local ZFS RAID there is only 1 VM - with system and data disk.

On Monday, there was a freeze of this machine - unresponsive / crippled.

On the Proxmox host, there was only this one line:

Code:

Jul 29 07:43:37 [hostname] QEMU[5986]: kvm: virtio: zero sized buffers are not allowed

On the Windows 2022 guest, there was such a report in the event log:

Code:

Warning | 29.07.2024 | 7:43:37 | vioscsi | Event ID 129 | Reset to device, \Device\RaidPort2, was issued.

and this line was repeated for every minute until hard-stop/reset was forced.

Only change I've done as a 1st approach so far was switch from "aio default (io_uring)" to "aio threads" for the data disk (scsi1).

There were no problems today, but it's too soon to be able to confirm this workaround is effective/resilient in our setup.

I think there is nothing new in this my report, but I just wanted to keep focus on the root cause.

So, it's not PBS/backup related, there are simply some virtio problems with Windows guests under high storage load.

But there are more pending questions - e.g. can it be related to q35/QEMU version as well?

I've seen on no-subscription channel that there is already QEMU 9.0 version, for example.

I'll post when the guest will have another problem after recent switch to "threads".

Finally, I would like to thank everyone who is interested in this serious issue.

So today was another hit.

As my last change was just about the switch from aio io_uring to threads on the data disk only, the same problem is back.

But my colleague tried first only to reboot this VM, what caused another hang after the reboot quickly. So it's required to hard-stop the machine, not rebooting, as the wrong QEMU storage state persists otherwise.

I think it was confirmed somewhere that it's needed to switch all the VM disks from io_uring to threads, so it's proven now once more.

Current setup is aio threads without iothread. I'd like to check the setup with iothread enabled too (on both disks), but we will see the current behavior.

This is the recap from the last hit:

Proxmox host:

Code:

Jul 31 04:48:43 [hostname] QEMU[4002618]: kvm: virtio: zero sized buffers are not allowed

Jul 31 07:29:53 [hostname] QEMU[4002618]: kvm: virtio: zero sized buffers are not allowed [after the VM rebooted, not stopped]

Windows guest:

Code:

Warning | 31.07.2024 4:48:37 | vioscsi | Event ID 129 | Reset to device, \Device\RaidPort2, was issued. [first event, next every other minute]
Warning | 31.07.2024 6:50:37 | disk | Event ID 153 | The IO operation at logical block address 0x1417c2b0 for Disk 1 (PDO name: \Device\00000027) was retried. [first event, but hundreds of the same events almost every minute, just the block address is different]

Critical | 31.07.2024 7:28:30 | Kernel-Power | Event ID 41 | The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. [this was the forced reboot/reset]

Last logged event log message was at 7:28:33 and from that time no more logs (until hard stop and next boot).
So when there was a 2nd error on hypervisor at 07:29:53, the VM logged nothing, so probably the disk access was lost.

And some probably interesting info for @fiona and @fweber.

When I was doing a RoCE PoC a year ago, I realized that aio threads was best in terms of sequential write.
This was the only setup when the Read and Write speed was on par (for SEQ1M Q8T1, measured from windows host).
But regarding random access, aio threads was approx. 20% under the io_uring/native baseline.

So in general, aio threads is not so bad - it's best for sequential access (non-blocking?), but for random acces native or io_uring are better.

RoCE-geek · Aug 7, 2024

Hi all, this is a result of my deep analysis of this problem within the last few days.

Just a quick recap for newcomers - if you're affected by sudden Windows VM hangs, reboots, file systems errors and related I/O stuff, especially with messages like:

Code:

PVE host:
QEMU[ ]: kvm: Desc next is 3
QEMU[ ]: kvm: virtio: zero sized buffers are not allowed

Windows guest:
Warning | vioscsi | Event ID 129 | Reset to device, \Device\RaidPort[X], was issued.
Warning | disk | Event ID 153 | The IO operation at logical block address [XXX] for Disk 1 (PDO name: \Device\000000XX) was retried.

this thread and post is probably for you (under specific circumstances).

I'm quite confident with my findings, as they are reproducible, (almost) deterministic and analyzed/captured in three different setups.

At first - there is (probably) nothing wrong with your storage (as some posts are trying to tell you, though in good faith - here or in github threads). If storage stress-tests work with correct drivers, VirtIO Block or non-virtualized, all recommendations about modifying some windows storage timeouts seem to be unfounded and doesn't make any sense to me. All my tests have been done with local storage and all are affected.
No iSCSI, no network storage, even no RDMA storage here. I'm absolutely sure regarding this.

TLDR:

It's all about drivers.
VirtIO SCSI drivers (vioscsi) up to 0.1.208 are stable. I can say that 0.1.208 (virtio-win-0.1.208.iso) is super stable. No problems so far, with extensive testing. But it's sufficient to apply only the "vioscsi" driver, not all other ISO drivers (i.e. SCSI driver downgrade is needed only).
Drivers since 0.1.215 (virtio-win-0.1.215.iso) are crippled. Howgh. Note - this is the first ISO version (from January 2022) where dedicated drivers for Windows 2022 were included (although it's only separate folder and it's binary identical version as for Windows 2019).
The most dangerous config is VirtIO SCSI Single with aio=io_uring (i.e. default). I'm able to crash it almost "on-demand" on each of my three independent setups.

If you're affected, these are the steps to become stable:

Update (i.e. downgrade) your "vioscsi" driver to 0.1.208. It's a definitive solution.
If you cannot downgrade „vioscsi" driver (e.g. too many VMs, etc.), switch to VirtIO Block - at least for problematic „data" drives.
You'll lose some performance (but max. 20% in general with SSDs and only under massive random load - many threads, high QL/QD), but don’t worry - there’re is a performance boost ready for QEMU 9.0 (more at the end of this post). It's also a definitive solution.
Get rid of „io_uring“ until the bug is fixed. Switch to „Native“ and if it doesn’t work, switch also to the base VirtIO SCSI (not „Single“). But wait - this is just a partial mitigation, not solution, but it may help a lot.

Note: switching to VirtIO Block or to base VirtIO SCSI implies storage change to the Windows guest. As a result, you'll probably need to switch affected disks back to online after the first boot.

 And now a deeper dive.

I’ve three identical VMs. VM101, VM102 and VM103.
They are cloned from the original VM. Windows Server 2022 ISO has been used for installation (updated 5/5/2023), no online Windows updates applied. CrystalDiskMark for x64, version 8.0.5 (URL: https://sourceforge.net/projects/crystaldiskmark/files/8.0.5/CrystalDiskMark8_0_5.exe/download). Full VirtIO driver set installed (virtio-win-gt-x64.msi) + QEMU agent (qemu-ga-x86_64.msi), version 0.1.240 (virtio-win-0.1.240.iso).

Initial config of each VM is as follows:

Code:

bios: ovmf
boot: order=scsi0
cpu: x86-64-v2-AES
machine: pc-q35-8.1
memory: 8192
net0: virtio=<MAC>,bridge=vmbr0,firewall=1,tag=<VLAN>
agent: 1
ostype: win11
efidisk0: <path>:vm-XXX-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
scsi0: <path>:vm-XXX-disk-1,aio=io_uring,iothread=1,size=40G,ssd=1
scsi1: <path>:vm-XXX-disk-2,aio=io_uring,iothread=1,size=80G,ssd=1
scsihw: virtio-scsi-single
sockets: 1
cores: 4
numa: 0

scsi0 is „system disk“ (with Windows installed), scsi1 is „data disk“, where the CrystalDiskMark tests are performed (with default NTFS format).
 During the tests, data disk "aio" is subject of change, as well as SCSI Controller (switch from SCSI Single to SCSI base).
  Each VM is running on different hardware (all SSDs are Enterprise / Datacenter editions):

VM101 - AMD EPYC3 CPU (74F3), local ZFS RAID 50 (2x RAID-Z1 across 8 SATA SSDs), repo PVE Enterprise (with QEMU 8.1), low or medium load
VM102 - AMD EPYC3 CPU (7343), LVM on local HW RAID - Broadcom MR9460-16i (RAID 5 across 8 SATA SSDs, stripe size 64kB), repo PVE Enterprise (with QEMU 8.1), empty node
VM103 - ancient Intel XEON E5-2667 v4 (dual-socket), LVM on local HW RAID - Areca ARC-1226-8i (RAID 5 across 8 SATA SSDs, stripe size 64kB), repo PVE No-subscription (with QEMU 9.0), for failover and testing only

I’m able to demonstrate this issue on each VM, i.e. on three different servers with PVE, although the sensitivity is different. 

Workload type - CrystalDiskMark is used, not as a benchmark here, but as a disk stress-tool only.

For this use-case it’s very sufficient, but some important settings required:

Use „NVMe SSD“ mode (menu Settings) for higher random load (up to Q32T16)
For all tests, use Disk D only (aka scsi1 = „data disk“)
Use default 5 repetitions (it’s usually sufficient)
Use sample file size 8 - 32GB. Start with 8GB, but higher data size does not mean higher probability of the issue

Fresh VM boot, > 5 mins. without activity.

These are some typical results (within the first run):

Scenario 1 - aio=io_uring, controller SCSI Single with „IO Thread“
This is the most buggy combination. Probability of the crippled (i.e. hanged) test is very high (> 80% for each VM, but 100% across all three VMs)
VM101: https://i.postimg.cc/nrW05Yjn/VM101-SCSI-Single-IO-uring.png
VM102: https://i.postimg.cc/RVVWbx4C/VM102-SCSI-Single-IO-uring.png
VM103: https://i.postimg.cc/8zLgRs97/VM103-SCSI-Single-IO-uring.png

Scenario 2 - aio=native, controller SCSI Single with „IO Thread“
This is a more stable combo, but still there’s at least one VM crippled and total probability of hang is > 70%
VM101: https://i.postimg.cc/gk20S88L/VM101-SCSI-Single-Native.png
VM103: https://i.postimg.cc/zfC5XGT1/VM103-SCSI-Single-Native.png
VM102 was OK in this run

Scenario 3 - aio=native, controller SCSI only (not Single, where „IO Thread“ is useless) 
Decent combo, but still there are some crashes and total probability of hang is approx. 50-70%
VM102:  https://i.postimg.cc/JnR8yCQH/VM102-SCSI-Basic-Native.png
VM103: https://i.postimg.cc/FRG43CPW/VM103-SCSI-Basic-Native.png
VM101 was OK in this run

Scenario 4 - aio=io_uring, controller SCSI Single with „IO Thread“ (i.e. back to default Scenario 1), but with „vioscsi“ driver 0.1.208 used
 Super-stable combo (the same results for aio=native, SCSI only, etc., doesn’t matter). I wasn’t able to break 0.1.208 driver at all, but I had one test with 0.1.204 where the issue was present, but surprisingly, after some time it was able to continue (aka test resurrection) and finished in normal way (which was really interesting and very rare).  
VM102 as an example of a successful test: https://i.postimg.cc/hvF8ngW6/VM102-SCSI-Single-IO-uring-0-1-208.png

Notes to the pictures:

Each atomic test starts with the creation (i.e. write) of the random test file of selected size.
On random access tests (with high QD/QL) it's usually visible as a first CPU peak.
Under sequential tests no peaks are usually visible.
When it freezes (i.e. vioscsi hang), there's a drop of CPU load to "zero", bandwidth (on data disk) is also zero, but utlization of disk D is still 100% (of Active Time). But there're cases, where the Resource Monitors is suddenly crippled by this buggy IO behavior, and is doing something for dozens of seconds (like the case in Scenario 1, VM 103 - see the last higher CPU load).
Almost in all cases there's a problem with Read or Write on RND 4kB Q32T16 test and the corresponding (last frozen) text in the window is "Random Read (X/5)" or "Random Write (X/5)", so the high QD (512) is a proven killer. And for example, if you see "Random Write (2/5)", there was a problem during (or at the end) of 2nd attempt and you see 3 CPU peaks (with the initial one). But it can also mean that 3rd attempt was not successfully started (what is the usual case).

Notes to the tests:

If your test will crash even with simple sequential IO, do vioscsi downgrade to 0.1.208 ASAP.
After each hang or crash, VM "Hard Stop" is required to get rid of this bug, along with next fresh boot.
If you'll kill the VM within some short time after hang, there'll probably be no error on KVM/QEMU, but even Windows Event Log needs some time to catch this bug (but usually there are 2-3 warnings regarding "vioscsi" with Event ID 129).
Contrary to some other reports, I've never registered a "broken" hypervisor, so every hang or crash was isolated and never affected other running VMs (luckily, but other VMs are on shared storage, not local).
In the first batch of my tests, scsi0 (system disk) was on aio=native, but the incidence was the same. But I highly recommend the system drive to be on "native", at least for this reason: the bug will be captured in the Event Log with higher probability (and the performance drop is none or very low, for such usage). The worst thing you can encounter is there are no logs, as the system drive was hanged too (or the only).
I've also done some tests with "aio=io_uring" without "IO Thread" (on SCSI Single), but the incidence was roughly the same, so I've removed this combo from other extensive tests.

Long story short - what are my tests showing:

There’s nothing wrong with your storage.
There’s nothing wrong with Proxmox itself.
There’s no difference between Intel and AMD.
It’s not true that this issue is common on network storages only. This is the definitive rebuttal. Local incidence is very high, across three different and independent servers / VMs.
The root cause is some change in the „vioscsi“ driver stack between versions 0.1.208 and 0.1.215. Here everyone should look first in my opinion. Even the newest driver (0.1.262) is still buggy.

So, there are no ghosts. For me it’s very easy to demonstrate the difference of stability between these two driver versions.

  I’m quite sure that this problem is much frequent than one may expect. The difference in incidence is the load, mainly very high QD/QL (IO queue depth / length), as the random R/W test with Q32T16 is the „serial killer“ here (with effective QD of 512). And this is the reason why especially DB servers are affected first (as such high QD is not generally common, as well as synchronous writes are required).

  All the problems regarding „io_uring“, reported by the community, are probably true. They are not fairytales.
 But „io_uring“ is just a most visible „victim“ of this vioscsi driver bug/change. So the problem is not the „io_uring“ itself. With correct driver it’s super-stable and best performer for very high random/transaction load (in the rank of Q32T16, for example). So generally it’s good that it’s a default for PVE, but not with buggy driver. 

Nice example of such thread is here: 
- starring @chrcoluk: https://forum.proxmox.com/threads/p...ive-io_uring-and-iothreads.116755/post-528359 - „My only feedback of significance here is I now routinely switch away from the default io_uring, was found to be the cause of data corruption in my windows guests.“
 - and @bbgeek17: https://forum.proxmox.com/threads/p...ive-io_uring-and-iothreads.116755/post-528521, with this answer: „Our guidance towards aio=native is based on a history of anecdotal reports like yours.“  

One may ask about the tests with aio=threads. It’s also widely covered in that thread, but in general: It’s a good workaround-like solution, usually topping sequential performance, but for random access is not ideal, as its real performance depends on system load, CPU cores count, etc. While I don’t focus on it, it may help in some scenarios, but it's not reproducible very well.  

Last but not least: I’m quite sure that my findings are very close to the real root cause (i.e. driver bug/change), but I can’t extrapolate my knowledge to every use-case in the wild. So you can try my way and let all of us know. Of course, in the world of drivers, kernel, open-source, etc., „the truth“ can always be reverted: „up to 0.1.208 there was a bug, which was since 0.1.215 resolved“. So I’m not hunting for the „culprit“, I just want to move forward quickly out of this serious bug, which is the only „not enterprise-ready" bug in the PVE world I know (in other words it’s a show-stopper for many enterprise setups, if you don’t have a stable solution without many drawbacks).  

Disclaimer: I’m posting this after three independent and extensive test runs, with the „hype-free“ pause between. Results and observations are still the same, no serious volatility has been observed. And a fun fact: no animal was hurt, but many NAND cells probably died. In total, I’ve run more than 50 tests, and each test was processed with 5 repetitions with sample size of 8, 16 or 32GB per atomic test (SEQ 1MB Q8T1, SEQ 128kB Q32T1, RND 4kB Q32T16, RND 4kB Q1T1 for Read and Write), i.e. dozens of TBs was written.

 And the bonus topic - VirtIO Block is not dead! I’m very happy that PVE adopted QEMU 9.0 in the non-enterprise repos so far (https://forum.proxmox.com/threads/qemu-9-0-available-on-pve-no-subscription-as-of-now.149772). But in my opinion, one important option in PVE is missing, as one of the greatest advancements of QEMU 9.0 is that „virtio-blk now supports multiqueue where different queues of a single disk can be processed by different I/O threads“ (https://www.qemu.org/2024/04/23/qemu-9-0-0). It works in addition to „IO thread“, but it’s needed to activate it in arguments/options - see iothread-vq-mapping in „kvm -device virtio-blk-pci,help“  

For more, see:  https://patchwork.kernel.org/project/qemu-devel/cover/20231220134755.814917-1-stefanha@redhat.com 
up to: https://patchwork.kernel.org/project/qemu-devel/patch/20231220134755.814917-5-stefanha@redhat.com
 „Note that JSON --device syntax is required for the iothread-vq-mapping parameter because it's non-scalar.“ 

Update: Added example of a successful test (see Scenario 4) + Notes to the tests

CC @fiona, @fweber and @t.lamprecht

_gabriel · Aug 8, 2024

RoCE-geek said:
I’m able to demonstrate this issue on each VM, i.e. on three different servers with PVE, although the sensitivity is different.

EDIT: Nice catch. Reproductible here.
Found 2 hosts with some "kvm: virtio: zero sized buffers are not allowed" errors.
Now, I'm remember some tests where I thought was ZFS fault where finally sometimes was hit this bug.

sub2o5 · Aug 8, 2024

Thanks for your analysis.
I must admit, i don't had the problem since months.
Afair since Proxmox 8.0/8.1 or Kernel 6.5 ? Can't remember exactly...

benyamin · Aug 8, 2024

Thanks @RoCE-geek for your extensive testing.

RoCE-geek said:
TLDR:

It's all about drivers.

This might appear to be true, but I don't think that is actually the case. I do acknowledge though, that you are trying to establish an effective workaround in order to confidently use the product in production.

RoCE-geek said:
Get rid of „io_uring“ until the bug is fixed. Switch to „Native“ and if it doesn’t work, switch also to the base VirtIO SCSI (not „Single“). But wait - this is just a partial mitigation, not solution, but it may help a lot.

I should mention aio=native did not work for me. I had to use aio=threads, per https://bugzilla.kernel.org/show_bug.cgi?id=199727. In combination with VirtIO SCSI Single, this resulted in better performance for my workloads. YMMV I guess.

RoCE-geek said:
There’s nothing wrong with Proxmox itself.

The root cause is some change in the „vioscsi“ driver stack between versions 0.1.208 and 0.1.215. Here everyone should look first in my opinion. Even the newest driver (0.1.262) is still buggy.

Bold claim imho. It's worth mentioning the comments in GitHub issue #756 that the issue is not seen in RH and Nutanix environments, but note Jon Kohler's comment mentions Nutanix's use of custom "host data path plumbing", which I think is somewhat telling.

Comment #14 in the kernel bug report mentions "aio=threads avoids softlockups because the preadv(2)/pwritev(2)/fdatasync(2) syscalls run in worker threads that don't take the QEMU global mutex. Therefore vcpu threads can execute even when I/O is stuck in the kernel due to a lock."

To me the root cause probably lies in the Debian (and maybe +Proxmox) implementation of virtio in relation to the QEMU global mutex. The driver issue might be coincident to a change not implemented in Debian (+Proxmox?) but implemented in RHEL, i.e. the driver might depend on a capability not present in Debian (+Proxmox?). As far as I'm aware the driver is platform agnostic, and probably does not interrogate for hypervisor capabilities.

Alternatively, there may have been a change in the Debian implementation coincident with the Bullseye release (as I mentioned above). IIRC, the Bullseye release falls between the 0.1.204 and 0.1.208 driver releases. Such a change may not be present in the RHEL implementation and thus not considered in the driver implementaiton. Similarly the issue appears between RHEL releases 8.4 and 8.5.

bbgeek17 · Aug 8, 2024

@RoCE-geek , thanks for your efforts here. Since you put in so much effort, I decided to ask one of our kernel devs to have a look. He was able to take one of our internal tools used for exposing race conditions and apply it to this problem. I'm happy to report that we can reliably reproduce the issue you described. More to come.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Member

Member

Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Renowned Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Famous Member

Member

Member

Distinguished Member

We value your privacy