Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

We’ve seen similar behavior but we only use cache=writeback in our environment for high performance. In our testing, only virtio-scsi versions 0.1.204 and earlier remain stable under high I/O. Anything from 0.1.208 onward becomes unstable during stress and we can easily reproduce instability when running multiple synthetic workloads with DiskSpd. Without cache=writeback, 0.1.208, 0.1.266 and 0.1.271 appear stable but we don't want to give up writeback caching.
 
This has been my anecdotal experience as well. I installed 0.1.285 on a windows/sql host and it caused a lot of suspect virtio-related messages in event viewer. Rolling it to the version before that (0.1.271) seems to have cleared it up. The older established bug-free versions (0.1.204 and I think .208) also still work well, though like the rest of us I’m sure, I worry a little about not getting the other unrelated bugfixes etc in the newer versions.

Thanks for continuing your efforts here!


In our environment, we use dozens of Windows Server 2019/2022, and we don't see any issues with version 1.285 drivers. However, there is only 1 server with the MSSQL database engine (2019, if I'm not mistaken), and I'm not entirely sure if it has been updated to the virtio drives version 1.285 or not.


P.S. @RoCE-geek
Maybe you can write a T-SQL script to reproduce the problem? I'm sure this can greatly speed up the solution.
 
Last edited:
P.S. @RoCE-geek
Maybe you can write a T-SQL script to reproduce the problem? I'm sure this can greatly speed up the solution.
This can be hard, because there's no clear initial condition. We have another two VMs with WS2025 and SQL2022, but there's a low traffic/load, so these errors/issues are quite rare and no service hangs so far.

But I'm quite sure that majority of setups with WS2025 and SQL Server are affected. Even with the low SQL load, there are similar reports/errors, at least a few dozens per week, but no one is probably aware of. Only if you dig into the SQL logs or Windows Application logs, you can find similar reports. But still, they are all "just" informative, no warnings, no errors. Just a sign there's something buggy inside the storage/VM stack.

Last but not least, given it's not just about the storage, as network may be affected as well, there's probably something more general to the Windows Server 2025 stack differences, i.e. maybe some general abstraction layer is the root cause.

But as always, deep and comprehensive diff between 0.1.271 and 0.1.285 should be a starting point for the initial analysis.
 
  • Like
Reactions: Whatever
I did some new tests, this time with an unreleased, Pre-WHQL (i.e. non-certified) vioscsi driver 0.1.292, found here: attestation-virtio-win-prewhql-0.1-292.zip

And it was quite fast, the bug is present, still with the high incidence, so even the new, unreleased-yet driver is buggy, but those affected with the network issues can try the newer NetKVM driver.

So still go with the 0.1.271 only for the WS2025 and SQL Server combo, it's the "safe" rollback solution for the problems described here.
 
OK, I'm quite confident that I've found and isolated the problem (and will drop it on github soon).

So let's start with the analysis. At first, be careful that virtio releases in time are not relevant to the commit dates.
I mean that if you see e.g. version 0.1.271 was released on fedorapeople.org at 2025-04-07, it was definitely not committed this date.
If you're really interested what is inside, you have to download *.src.rpm package from the same path (here e.g. virtio-win-0.1.271-1).
This can be done within the Windows (VM), no Linux needed. It's just double 7-zipped archive. So do unzipping until you see the final folder (e.g. virtio-win-prewhql-0.1-271-sources -> internal-kvm-guest-drivers-windows). This is the real source folder.

Next check the status.txt file immediately. First surprise is that these fedora releases are not based on the public github repo.
This is clearly visible as this files is referring to the RHEL's internal git://git.engineering.redhat.com/users/vrozenfe/internal-kvm-guest-drivers-windows/.git
"vrozenfe", aka Vadim Rozenfeld, is the key maintainer of the whole virtio project, and specially dedicated to storage + balloon + serial + gpu.

So in this file you'll see the latest commit first, for 0.1.271 it's something like this:

Date 14 Jan 2025
repo git://git.engineering.redhat.com/users/vrozenfe/internal-kvm-guest-drivers-windows/.git
tag mm291

Fixed issues:

RHEL-69076: Formatting viomem driver with clang-format
RHEL-69073: broken style fix for viofs driver
RHEL-69079: clang-format for vioserial folder
...

Now you have to check the official repo for the corresponding commits around the date: virtio-commits-master

After checking the content we decide go with the Jan 13, 2025 commits, as they contain all mentioned RHEL issues in status.txt.
So we can conclude that 0.1.271 build commit is 0e263be. And FYI, Jan 13, 2025 is also the driver date visible in Windows Device Manager.
You can also see the tag "mm291" - these are some incremental RHEL's tags, and I just decoded that probably the virtio release number is the mm-tag-id minus 20, so here it is 0.1.271 - I don't know why, but it's not so important. The same is true for mm312, which corresponds to 0.1.292, etc.

And here you can see another problem - release timing. It's not driven by the community needs, but by the RHEL releases / milestones.
And this is why I'm saying that Proxmox Server Solutions GmbH should fork the repo and maintain regular updates, i.e. based on user reports and/or bug-resolution importance. In case of 0.1.271, there was almost 3 months pointless delay.

The same delay is true for 0.1.285 - the corresponding commit bd965ef is from Jul 2, 2025, version was built 7 days later (2025-07-09), but it was released on fedorapeople 2025-09-12, so more than 2 months later.

OK, back to the core problem. I was sure that in 0.271 all is working well (in terms of vioscsi with SQL Server and Windows Server 2025), and in the next published version, 0.1.285, it was already crippled.

So we have to check commits > 0e263be (Jan 13) and <= bd965ef (Jul 2), and hope that limiting to vioscsi changes would be enough.

Looking to those vioscsi commits, there's only one suspicious, related to the symptoms I've described earlier:

Date (UTC on GitHub)CommitWhat it changedWhy it’s suspicious for WS 2025
2025-03-051bbc422 – “Address possible memory management issues when receiving interrupts for already completed requests”Reworks how completions are matched: introduces an SRB ID counter, uses that ID (cast to pointer) as the virtqueue cookie, adds a free-list for SRB extensions, and guards against interrupts for “already completed” SRBs. This is a hot path (ISR/DPC ↔ completion mapping). If any path leaves id uninitialized / reused, or a race slips in, you can get a one-off bad read that succeeds on retry—very similar to your “succeeded after failing once” SQL Server messages. WS 2025 uses newer Storport and can expose timing/locking bugs that WS 2022 never hits.

Sounds like a rocket science? Maybe, but it's not the point. SRB is the omnipresent buzzword around the whole vioscsi stack, as it's the "SCSI Request Block". Almost all the bugs in vioscsi are related to the SRB calls and its processing. And the last year, when we all were fighting with the "Reset to device, \Device\RaidPort[X], was issued." bugs, it was all about SRB as well. And in that time, after a brief period of hesitation and his doubts (regarding my strict confidence that all is in the driver itself), @benyamin quickly became a greatest viosci contributor. And based on his deep analysis and corresponding changes, this issue was successfully resolved, so this is why 0.1.266 was a bug-free version (kudos to him, but it seems he's no longer active here).

In theory, I've been in the hunt of bugs between 0.1.271 and 0.1.285, but I finally decided to work simply on the latest master (checked out a few days ago), which corresponds to the not-yet-released (and latest) 0.1.292. So I wanted ideally one simple revert to mitigate it clearly.

Spoiler: after almost 48 hours in production, I can say that revert of 1bbc422 is the only mitigation needed. So really, this commit revert solves the problem entirely. But while it's probably not your use-case, the same bug was introduced into viostor as well (aka virtio-block), so for an universal mitigation revert of c09af90 will be need too.

OK, we can now revert and test the buggy commit, but what's the root cause? As I wrote in the first post: "some kind of race conditions (my expectation)". And really, it's more than valid here. While it's true that on Windows Server 2022 there are no such issues so far (according to the reports of others), Storport in Windows Server 2025 changed towards better performance. It's quite clear that it scales much better on multi-core systems, so while in WS2022 there was a low or zero risk that the driver requests would be parallelized, the opposite is true for WS2025.

In other words, on WS2025’s Storport, the timing mix can cause a rare mis-match or stale completion → one read returns the wrong bytes (page checksum/pageid “wrong”), SQL Server retries immediately, the next read hits the right buffer and “succeeds after failing 1 time”. This also explains why WS2022 looks fine (different Storport timing), and why it shows up only with newer drivers.

OK, but is there a clear problem visible in that commit? Simply said: yes!

At least for me, and even more for my virtual coworker (GPT-5 Thinking), these two lines (1bbc422 ‎vioscsi/vioscsi.c) are highly suspicious:

Code:
        srbExt->id = adaptExt->last_srb_id;
        adaptExt->last_srb_id++;

It's clear that the assignment and increment are not one atomic call (no lock, no interlocked op), so if occasionally this call is processed per-partes, the read attempt have to fail.

More details: On modern Storport (as in Windows Server 2025), StartIo can run on multiple CPUs/queues much more aggressively than older builds, so two threads can actually read the same last_srb_id and hand out duplicate IDs. When the host later completes those I/Os, the driver can mismatch which SRB it completes or fail to find one (“No SRB found for ID”), causing a transient bad read that SQL Server retries — exactly the “succeeded after failing 1 time” messages mentioned above.

But please note: although it's not relevant to this specific bug, version 0.1.292 contains fresh, Sep 26, 2025 revert (commit a6d690a - Revert "NO-SDV [vioscsi] Reduce spinlock management complexity") for the @benyamin's Nov 20, 2024 initial commit 15e64ac - which is already included in 0.1.271. The corresponding PRs are 1175 and 1293. So this commit/revert is also generally suspicious (check the threads), but as the initial commit is already present in the very well working 0.1.271, I can safely refuse any potential negative impact to this read-retry problem. And although this commit/revert is mainly about potentially buggy driver initialization (without expected runtime influence), you've to be always cautious about similar commits/reverts.


Long story short: I've analyzed, isolated, solved, validated and described the root cause of this read-retry issue on Windows Server 2025.
And I hope this long post will serve as either educational or motivational kick for you, to be aware that bug hunting, iterating and driver building is definitely not a nuclear science, as I served to myself as a zero-to-hero out of necessity. My GPT-5 Thinking buddy further generated probably viable code diff for the buggy commit 1bbc422, so this (my) revert is just a temporary solution.


Bottom line: I had completely zero previous experience with the windows driver build process. But with the help of GPT-5, it was really easy, in addition to this clear virtio-win driver guide. But even if I've the EV code-signing certificate, still it's not possible to sign Windows Server drivers this way. This is really not simple process for an individual, but quite often for an medium company like Proxmox.

Technically, you will need to install and run Windows Hardware Lab Kit (HLK), process the required driver tests, submit all to MS Partner Central (here's your EV cert required - both for account authentication and signing the submission CAB file), and then Microsoft returns a dashboard/WHQL signature if all is clear. As you can see, this is definitely a no go for any bug hunting and testing.

Instead, there's a quite straightforward path - shipped virtio source (I mean the *.src.rpm and the internal-kvm-guest-drivers-windows folder) already includes test-signing certificate (and the corresponding signing batch files), so all you need is disable secure boot for the testing Windows VM (in the UEFI/boot phase), then enable test-signing (via bcdedit /set testsigning on) and at this point, you can install your fresh driver build.

In my case, I had installed complete virtio drivers for 0.1.285, and was just changing/testing the vioscsi drivers (every single build for every single commit/revert compiled). Here you should be careful to correctly version the builds (in the INF file). For instance, the baseline for 0.1.292 is 100.102.104.29200, so I created multiple subsequent 100.102.104.29201, 100.102.104.29202, 100.102.104.29203, etc. driver versions. It's really important to retain correct numbering, otherwise your driver list will be a complete mess (and you always need the correct upstream version binding).