Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

bbgeek17 · Aug 8, 2024

Okay, here's a bit more information from our lab:

PVE pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-3-pve)
Windows 2022, 4 Cores, 8G Memory
Virtio Driver Version: virtio-win-0.1.262-2.iso (latest, released 8/7/24)
Both aio=native and aio=io_uring reproduce the issue.

We're seeing three distinct signatures reported by QEMU/KVM that may indicate corruption in the queue memory shared between the guest and QEMU.

So far, we've been able to reproduce the issue with iothread enabled. We're going to perform some more focused testing without an iothread option.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

RoCE-geek · Aug 8, 2024

benyamin said:
Thanks @RoCE-geek for your extensive testing.

This might appear to be true, but I don't think that is actually the case. I do acknowledge though, that you are trying to establish an effective workaround in order to confidently use the product in production.

I should mention aio=native did not work for me. I had to use aio=threads, per https://bugzilla.kernel.org/show_bug.cgi?id=199727. In combination with VirtIO SCSI Single, this resulted in better performance for my workloads. YMMV I guess.

Bold claim imho. It's worth mentioning the comments in GitHub issue #756 that the issue is not seen in RH and Nutanix environments, but note Jon Kohler's comment mentions Nutanix's use of custom "host data path plumbing", which I think is somewhat telling.

Comment #14 in the kernel bug report mentions "aio=threads avoids softlockups because the preadv(2)/pwritev(2)/fdatasync(2) syscalls run in worker threads that don't take the QEMU global mutex. Therefore vcpu threads can execute even when I/O is stuck in the kernel due to a lock."

To me the root cause probably lies in the Debian (and maybe +Proxmox) implementation of virtio in relation to the QEMU global mutex. The driver issue might be coincident to a change not implemented in Debian (+Proxmox?) but implemented in RHEL, i.e. the driver might depend on a capability not present in Debian (+Proxmox?). As far as I'm aware the driver is platform agnostic, and probably does not interrogate for hypervisor capabilities.

Alternatively, there may have been a change in the Debian implementation coincident with the Bullseye release (as I mentioned above). IIRC, the Bullseye release falls between the 0.1.204 and 0.1.208 driver releases. Such a change may not be present in the RHEL implementation and thus not considered in the driver implementaiton. Similarly the issue appears between RHEL releases 8.4 and 8.5.

Hi @benyamin, I'm very happy for your inputs, so let me some comments.

Bold claim imho.

Yes, you're absolutely right and I'm usually skeptical to such statements too.

But let's do some short recap:

This problem is not new. It's in the PVE community for more than two years, although fragmented / isolated.
I've found approx. 10 threads regarding "Desc next is 3", "virtio: zero sized buffers are not allowed" or "Reset to device".
Affected people are sad, often even desperate. Although there are others willing to help, I've seen many more or strange advices how to solve it, or at least mitigate it. It was a first warning for me, as there is a mix of unrelated tips, often confusing, often illogical. But at least everyone is trying to help and it's very valuable. But hard for any newcomer to this problem - it's a mess.
I've checked all the relevant threads known to me, including the github ones and even those with more general issues.
And to be honest, many posts seemed to me like a discussion of the members of the gentlemen's "Old England Club".
It doesn't make sense to me that too many smart people are discussing too long about this problem without any serious resolution.
It's full of assumptions, speculations, impressions, suspicions, but there are a very few (more or less hard) facts. It was another serious warning for me.
After more than 20 years in enterprise IT I've learned that endless theorizing, albeit in the best spirit and with sincere motivation, never delivered a solution. I simply got the feeling that nobody wants to get their hands dirty.

So I decided to "solve it" by my way, as usual. And I've one secret weapon. In the "bug hunting" world, I expect nothing, but I'm ready for anything. In other words, when I'm deep in some topic, I don't care about the opinion of others. I'm immune to (even my personal) feelings, impressions, imaginations, etc.

I just need some measure, some numbers to compare and do the hard work. So only "incidence analysis" is what counts.

When I realized that I've found a "1/0" switch (in my environment), I had no emotions, but I had many doubts if it's enough. So the more I've tried to shoot this conclusion, but no way. It's fully reproducible on my side and it's causal. Driver's version rules.

If you'll find that 0.1.208 is (super) stable, and the closest higher version (0.1.215) is defective, what you'll do?
I've tried some other versions < 0.1.208 and others > 0.1.215, but the conclusion was the same.
The "sweet spot" between 0.1.208 and 0.1.215 is simply there and I can't ignore this.
And it was the same in 2nd round, 3rd round, etc.

I can't presume to speculate, just summarize the "facts" (although I can still have some doubts).

And only the "facts" I have are those I've described. I don't know if in the diff between 0.1.208 and 0.1.215 is the solution, but I'm quite sure that there is a key. Key to some definitive solution (or at least explanation).

I should mention aio=native did not work for me.

It's not in contradiction to my findings. I've stated that this is just a mitigation.
Only stable solution (for me) is the vioscsi downgrade to 0.1.208 or VirtIO Block.

Last but not least: I'm open to any meaningful solution, or any incidence analysis from others.

If the tip with vioscsi 0.1.208 will not work for others, it's absolutely OK, but I'm not aware of more such "dirty hands".

But still, I can't ignore my findings, although I know very well that demonstrations like "1/0 switch" are usually rare and suspicious.

I've a deep respect to anyone who is working hard on this problem, but as of me, it's really time for solution, not for more speculations.

I apologize if I've potentially oversimplified some things in this post, the goal was not to disparage anyone (much less you), but to explain my way of thinking and problem solving.

So let's go ahead to some promising future

carl0s · Aug 8, 2024

Just to add to the chorus - one of my Sever 2022 guests does the same. I have tried many variations of attaching disk via sata0, scsi0, then using scsi controller scsi single, scsi not-single, virtio block, disk threads, io_uring, etc. If there is a windows update installing, the system is sluggish, and if I open Event viewer while windows update is installing, eventvwr freezes briefly and there is the entry in the system log saying disk io was retried, which is quite worrying. I have other Server 2022 VMs on the same host which don't do this strangely but perhaps their workload is just very different. Host is a quite powerful Poweredge R7615, Epyc 9174F, PERC H965i with 4x intel SSD in a RAID5. guest disks are lvm-thin.

RoCE-geek · Aug 8, 2024

bbgeek17 said:
Okay, here's a bit more information from our lab:

PVE pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-3-pve)

Windows 2022, 4 Cores, 8G Memory

Virtio Driver Version: virtio-win-0.1.262-2.iso (latest, released 8/7/24)

Both aio=native and aio=io_uring reproduce the issue.

We're seeing three distinct signatures reported by QEMU/KVM that may indicate corruption in the queue memory shared between the guest and QEMU.

So far, we've been able to reproduce the issue with iothread enabled. We're going to perform some more focused testing without an iothread option.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hi @bbgeek17, excellent news!

I'm very happy to see more "dirty hands" rising

And BTW, I have to seriously acknowledge your contribution to this forum in general.
It's all the more valuable as you're basically a "storage vendor representative", yet you are totally objective and dedicated to serving the community. You've my respect and please keep going.

JCNED · Aug 9, 2024

Wanted to join the thread here that we have been running into the same issue. We have standardized all high I/O VM's (SQL and Exchange typically) on the 204 virtio release (have not tried 208 but will add that)

We can easily reproduce it on all clusters, Windows Server version does not matter (We have 2016 - 2022 in production). Cluster sizes are random but the primary is 9 Nodes, AMD Epyc 3rd Gen, TB Ram per Node, and 105 NVME OSD's with 25x4GB Networking.

Would love a modern solution to this, but at the time 204 is working great for us, and as someone already stated, the 2019 Drivers from 204 work fine in 2022.

Happy to help in anyway we can, logs, etc

RoCE-geek · Aug 9, 2024

carl0s said:
Just to add to the chorus - one of my Sever 2022 guests does the same. I have tried many variations of attaching disk via sata0, scsi0, then using scsi controller scsi single, scsi not-single, virtio block, disk threads, io_uring, etc. If there is a windows update installing, the system is sluggish, and if I open Event viewer while windows update is installing, eventvwr freezes briefly and there is the entry in the system log saying disk io was retried, which is quite worrying. I have other Server 2022 VMs on the same host which don't do this strangely but perhaps their workload is just very different. Host is a quite powerful Poweredge R7615, Epyc 9174F, PERC H965i with 4x intel SSD in a RAID5. guest disks are lvm-thin.

Hi @carl0s, if vioscsi driver downgrade doesn't help, as well as even SATA or VirtIO Block is not a mitigation, there may be some serious issue in the VM itself and hence complete reinstall is an option.

RoCE-geek · Aug 9, 2024

JCNED said:
Wanted to join the thread here that we have been running into the same issue. We have standardized all high I/O VM's (SQL and Exchange typically) on the 204 virtio release (have not tried 208 but will add that)

We can easily reproduce it on all clusters, Windows Server version does not matter (We have 2016 - 2022 in production). Cluster sizes are random but the primary is 9 Nodes, AMD Epyc 3rd Gen, TB Ram per Node, and 105 NVME OSD's with 25x4GB Networking.

Would love a modern solution to this, but at the time 204 is working great for us, and as someone already stated, the 2019 Drivers from 204 work fine in 2022.

Happy to help in anyway we can, logs, etc

Hi @JCNED, welcome to the club

If you're on 0.1.204, it's absolutely OK, no action is needed. It's the closest lower version to 0.1.208.

0.1.204 seems to be stable too, I just had one hang, but the test was able to resurrect and finished successfully.

And yes, drivers for Windows 2016 - 2022 are identical. The only difference is Win2025, but it's still in Preview, so we will see.

benyamin · Aug 9, 2024

For those investigating this issue, please remember that using aio=threads completely resolves the problem on local storage regardless of driver version, per https://bugzilla.kernel.org/show_bug.cgi?id=199727#c24.

benyamin · Aug 9, 2024

Looks like I wasn't sent a few post notifications, and after a double refresh here I see further comments above.

@RoCE-geek, I do agree it's a very frustrating problem. I hit the issue pretty quickly following migration from VMware in late 2022. Looking back there was a long history of the problem beginning in August 2021 as reported by @RolandK at that time. It's been frustrating to watch the problem be effectively ignored for the most part - at least that's been my perception. I too have a long history in enterprise IT - over 30 years - and sometimes this is just how it rolls when an issue is effectively worked around (usually when such a work around provides a competitive advantage imho).

For me the problem with using such a down-level, archaic driver revision is supportability and related inherent risks. It is circa three years and a dozen revisions old. That just wouldn't fly in most enterprise shops, just from a security perspective.

I noticed @bbgeek17's comments above and I am hopeful that an issue in the Debian/Proxmox QEMU implementation might reveal itself.

Anyway, hopefully it's getting some traction now...

benyamin · Aug 9, 2024

carl0s said:
Just to add to the chorus - one of my Sever 2022 guests does the same. I have tried many variations of attaching disk via sata0, scsi0, then using scsi controller scsi single, scsi not-single, virtio block, disk threads, io_uring, etc. If there is a windows update installing, the system is sluggish, and if I open Event viewer while windows update is installing, eventvwr freezes briefly and there is the entry in the system log saying disk io was retried, which is quite worrying. I have other Server 2022 VMs on the same host which don't do this strangely but perhaps their workload is just very different. Host is a quite powerful Poweredge R7615, Epyc 9174F, PERC H965i with 4x intel SSD in a RAID5. guest disks are lvm-thin.

@carl0s , I am very confident your problem will be resolved if you use aio=threads and iothread=1 with Virtio SCSI Single on all VMs.

bbgeek17 · Aug 9, 2024

We can confirm @RoCE-geek findings that:
a) the vioscsi.sys driver from ISO 0.1.208-1 does not produce the failure.
b) The next official ISO release 0.1.215-1 easily produces the failure.

For those who need the mapping between ISO release and driver version:

virtio-win-0.1.208-1.iso : 100.85.104.20800 8/30/2021 [issue is not reproducible]
virtio-win-0.1.215-1.iso : 100.90.104.21500 12/2/2021 [issue confirmed]
virtio-win-0.1.262-2.iso : 100.95.104.26200 7/15/2024 [issue confirmed]

Through our testing, we can conclude that

the failure occurs with aio=io_uring, aio=native, and aio=threads
the failure occurs with and without an iothread.

I'll see if we can have our developers contact the virtio/Redhat folks to resolve this. It might take a bit due to ongoing PTO.

@benyamin - please note that we can reproduce the failure in our testing harness even with aio=threads, iothread=1, and virtio-SCSI-single. As such, we do not see this combination as a solution or workaround to the issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

benyamin · Aug 9, 2024

bbgeek17 said:
@benyamin - please note that we can reproduce the failure in our testing harness even with aio=threads, iothread=1, and virtio-SCSI-single. As such, we do not see this combination as a solution or workaround to the issue.

That's very interesting. Perhaps we are dealing with more than one issue. Is there an I/O threshold at which the issue occurs?

Can you share more info re your environment and reproducer...? What caching config are you using? Did you switch all disks on all VMs to aio=iothreads, iothread=1 and virtio-scsi-single ? I note this is required.

How does one account for the issue being triggered by I/O in another VM, sometimes not even a Windows VM..?

Also, why is this issue neither observed nor reported in FreeBSD, RH or Nutanix environments?

We found on various hardware platforms over 17 sites with 70+ servers that the aforementioned aio=iothreads combo resolved the issue in all but one case where "kvm: Desc next is 4" was also observed. For that case it turned out to be a storage subsystem issue.

EDIT: Fixed immaterial typo.

benyamin · Aug 9, 2024

benyamin said:
Did you switch all disks on all VMs to aio=iothreads, iothread=1 and virtio-scsi-single ? I note this is required.

Just to clarify, aio=iothreads, iothread=1 and virtio-blk can co-exist. I note I also did not test without iothread=1.

benyamin · Aug 9, 2024

@RoCE-geek, in your testing of aio=threads, did you have any disks on any VM using aio=native or aio=io_uring at the same time?

RoCE-geek · Aug 9, 2024

bbgeek17 said:
We can confirm @RoCE-geek findings that:
a) the vioscsi.sys driver from ISO 0.1.208-1 does not produce the failure.
b) The next official ISO release 0.1.215-1 easily produces the failure.

For those who need the mapping between ISO release and driver version:

virtio-win-0.1.208-1.iso : 100.85.104.20800 8/30/2021 [issue is not reproducible]

virtio-win-0.1.215-1.iso : 100.90.104.21500 12/2/2021 [issue confirmed]

virtio-win-0.1.262-2.iso : 100.95.104.26200 7/15/2024 [issue confirmed]

Through our testing, we can conclude that

the failure occurs with aio=io_uring, aio=native, and aio=threads

the failure occurs with and without an iothread.

I'll see if we can have our developers contact the virtio/Redhat folks to resolve this. It might take a bit due to ongoing PTO.

@benyamin - please note that we can reproduce the failure in our testing harness even with aio=threads, iothread=1, and virtio-SCSI-single. As such, we do not see this combination as a solution or workaround to the issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hi @bbgeek17, I'm very happy for such confirmation. I had no other goal than this.

I was also thinking about to send an email to the most active contributor in the initial VirtIO github thread - Vadim Rozenfeld (vrozenfe_at_redhat_dot_com - "vrozenfe"), but now I'm quite sure you're the right person. Many thanks for your effort - this is a small step for "us", but (really) a big step for the community (without any pathos).

And @benyamin - regarding aio=iothreads, my findings are that this option is not usually understood well. It seems like a "magical solution" for many mixed issues, but in reality it's just hiding the root cause. And this is the exact reason why I've not been doing extensive testing with it, as its principle is against the goal of bug hunting, when I need some bad behavior to invoke clearly.

And one more observation - there is a difference between hardcore (synthetic) stress-testing and real use-cases. I also have one machine with SCSI (base) and aio=native. While I'm able to beat it under stress test, yet under real load it's stable so far (but anyways, it's a time limited test, as I don't like a playing a roulette in production). In other words, this bug is all about probability and race conditions, exactly as @bbgeek17 stated first. And this is also a reason why DB servers are usually in the frontline of incidence.

And regarding security impact of "old" drivers - the most dangerous version is that with some proven bad behavior. So if there is "security first" policy, any driver/library with proven bug is the biggest danger, at least in my measures.

RoCE-geek · Aug 9, 2024

benyamin said:
@RoCE-geek, in your testing of aio=threads, did you have any disks on any VM using aio=native or aio=io_uring at the same time?

Well, it's hard to say. On one EPYC3 (VM102), there was only this one machine, so it's ~~probable~~, but I cannot confirm with confidence. I can just try it once more as a dedicated test.

Update: It's probable that this one VM had both disk on aio_thread, to be correct.

benyamin · Aug 9, 2024

RoCE-geek said:
Well, it's hard to say. On one EPYC3 (VM102), there was only this one machine, so it's probable, but I cannot confirm with confidence. I can just try it once more as a dedicated test.

That would be helpful if you can.

I think the theory - if you will permit it - is that by ensuring there is no disk I/O bound by the QEMU global mutex, the issue cannot be reproduced.

bbgeek17 · Aug 9, 2024

Hi @benyamin, few quick answers:
- no caching is used in our testing. We also never recommend the use of caching (related to storage) for production.
- no, we did not switch all disks on all VMs to iothreads. We were not aware that a mass change was needed to achieve stability. We may perform more tests next week, but frankly would rather concentrate on zooming in on the issue rather than on a workaround.
- no opinion on reproducibility on other OS'es as we did not try that.
- no opinion on cross-VM triggering, we did not observe this in our limited testing.

More testing could be helpful. However, given the reproducibility of a problem in a common use case, that is also the default in the product to which this forum belongs, it seems more useful to try to narrow down the actual issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

benyamin · Aug 9, 2024

bbgeek17 said:
Hi @benyamin, few quick answers:
- no caching is used in our testing. We also never recommend the use of caching (related to storage) for production.
- no, we did not switch all disks on all VMs to iothreads. We were not aware that a mass change was needed to achieve stability. We may perform more tests next week, but frankly would rather concentrate on zooming in on the issue rather than on a workaround.
- no opinion on reproducibility on other OS'es as we did not try that.
- no opinion on cross-VM triggering, we did not observe this in our limited testing.

More testing could be helpful. However, given the reproducibility of a problem in a common use case, that is also the default in the product to which this forum belongs, it seems more useful to try to narrow down the actual issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thanks for the follow-up @bbgeek17.

We share the same opinion re caching. I just wanted to check it wasn't hiding the issue as discussed in the kernel bug report.

Re root cause analysis, and whether concentrating on the driver or Debian or QEMU or Proxmox-specific patches as the solution or the workaround is a matter of perspective, I guess. For me, because the driver works in other environments, fixing it would be the workaround. The virtio/RH team might still be willing to make a change but I'm pretty sure the driver is platform agnostic, and there might be a few fingers pointing back our way before it gets resolved.

The kernel bug report is really worth a read. The OP, Gergely Kovacs, gives a concise synopsis at Comment 18, but practically every post in that thread is informative, especially those from Stefan Hajnoczi and Roland Kletzing (@RolandK I have presumed).

RoCE-geek · Aug 9, 2024

benyamin said:
That would be helpful if you can.

I think the theory - if you will permit it - is that by ensuring there is no disk I/O bound by the QEMU global mutex, the issue cannot be reproduced.

@benyamin - I'm sorry, but it's still buggy.

VM102 - the only one VM on a PVE node
Drivers back to 0.1.240
SCSI Single
Both disks (scsi0 + scsi1) on aio=threads, iothread=1, ssd=1
EFI disk unchanged (as no option is there)

Hanged in the first run, in the initialization phase of Write for Q32T16

Picture is from recent tests (SCSI Basic + Native), but the behavior is the same for the "threads/single/iothread" setup now: https://i.postimg.cc/JnR8yCQH/VM102-SCSI-Basic-Native.png

Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Distinguished Member

Member

Well-Known Member

Member

Member

Member

Member

Member

Member

Member

Distinguished Member

Member

Member

Member

Member

Member

Member

Distinguished Member

Member

Member

We value your privacy