OK, I'm quite confident that I've found and isolated the problem (and will drop it on github soon).
So let's start with the analysis. At first, be careful that virtio releases in time are not relevant to the commit dates.
I mean that if you see e.g. version 0.1.271 was released on fedorapeople.org at 2025-04-07, it was definitely not committed this date.
If you're really interested what is inside, you have to download
*.src.rpm package from the same path (here e.g.
virtio-win-0.1.271-1).
This can be done within the Windows (VM), no Linux needed. It's just double
7-zipped archive. So do unzipping until you see the final folder (e.g. virtio-win-prewhql-0.1-271-sources ->
internal-kvm-guest-drivers-windows). This is the real source folder.
Next check the
status.txt file immediately. First surprise is that these fedora releases are not based on the public github repo.
This is clearly visible as this files is referring to the RHEL's internal git://git.engineering.redhat.com/users/vrozenfe/internal-kvm-guest-drivers-windows/.git
"vrozenfe", aka Vadim Rozenfeld, is the key maintainer of the whole virtio project, and specially dedicated to storage + balloon + serial + gpu.
So in this file you'll see the latest commit first, for 0.1.271 it's something like this:
Date 14 Jan 2025
repo git://git.engineering.redhat.com/users/vrozenfe/internal-kvm-guest-drivers-windows/.git
tag mm291
Fixed issues:
RHEL-69076: Formatting viomem driver with clang-format
RHEL-69073: broken style fix for viofs driver
RHEL-69079: clang-format for vioserial folder
...
Now you have to check the official repo for the corresponding commits around the date:
virtio-commits-master
After checking the content we decide go with the Jan 13, 2025 commits, as they contain all mentioned RHEL issues in status.txt.
So we can conclude that 0.1.271 build commit is
0e263be. And FYI, Jan 13, 2025 is also the driver date visible in Windows Device Manager.
You can also see the tag "mm
291" - these are some incremental RHEL's tags, and I just decoded that probably the virtio release number is the mm-tag-id minus 20, so here it is 0.1.
271 - I don't know why, but it's not so important. The same is true for mm312, which corresponds to 0.1.292, etc.
And here you can see another problem - release timing. It's not driven by the community needs, but by the RHEL releases / milestones.
And this is why I'm saying that
Proxmox Server Solutions GmbH should fork the repo and maintain regular updates, i.e. based on user reports and/or bug-resolution importance. In case of 0.1.271, there was almost 3 months pointless delay.
The same delay is true for 0.1.285 - the corresponding commit
bd965ef is from Jul 2, 2025, version was built 7 days later (2025-07-09), but it was released on fedorapeople 2025-09-12, so more than 2 months later.
OK, back to the core problem. I was sure that in 0.271 all is working well (in terms of vioscsi with SQL Server and Windows Server 2025), and in the next published version, 0.1.285, it was already crippled.
So we have to check commits > 0e263be (Jan 13) and <= bd965ef (Jul 2), and hope that limiting to vioscsi changes would be enough.
Looking to those vioscsi commits, there's only one suspicious, related to the symptoms I've described earlier:
| Date (UTC on GitHub) | Commit | What it changed | Why it’s suspicious for WS 2025 |
|---|
| 2025-03-05 | 1bbc422 – “Address possible memory management issues when receiving interrupts for already completed requests” | Reworks how completions are matched: introduces an SRB ID counter, uses that ID (cast to pointer) as the virtqueue cookie, adds a free-list for SRB extensions, and guards against interrupts for “already completed” SRBs. | This is a hot path (ISR/DPC ↔ completion mapping). If any path leaves id uninitialized / reused, or a race slips in, you can get a one-off bad read that succeeds on retry—very similar to your “succeeded after failing once” SQL Server messages. WS 2025 uses newer Storport and can expose timing/locking bugs that WS 2022 never hits. |
Sounds like a rocket science? Maybe, but it's not the point. SRB is the omnipresent buzzword around the whole vioscsi stack, as it's the "SCSI Request Block". Almost all the bugs in vioscsi are related to the SRB calls and its processing. And the last year, when we all were fighting with the "Reset to device, \Device\RaidPort[X], was issued." bugs, it was all about SRB as well. And in that time, after a brief period of hesitation and his doubts (regarding my strict confidence that all is in the driver itself), @benyamin quickly became a greatest viosci contributor. And based on his deep analysis and corresponding changes, this issue was successfully resolved, so this is why 0.1.266 was a bug-free version (kudos to him, but it seems he's no longer active here).
In theory, I've been in the hunt of bugs between 0.1.271 and 0.1.285, but I finally decided to
work simply on the latest master (checked out a few days ago), which corresponds to the not-yet-released (and latest) 0.1.292. So I wanted ideally one simple revert to mitigate it clearly.
Spoiler: after almost 48 hours in production, I can say that
revert of 1bbc422 is the only mitigation needed. So really, this commit revert solves the problem entirely. But while it's probably not your use-case, the same bug was introduced into
viostor as well (aka virtio-block), so
for an universal mitigation revert of c09af90 will be need too.
OK, we can now revert and test the buggy commit, but what's the root cause? As I wrote in the first post: "some kind of race conditions (my expectation)". And really, it's more than valid here. While it's true that on Windows Server 2022 there are no such issues so far (according to the reports of others),
Storport in Windows Server 2025 changed towards better performance. It's quite clear that it scales much better on multi-core systems, so while in WS2022 there was a low or zero risk that the driver requests would be parallelized, the opposite is true for WS2025.
In other words, on
WS2025’s Storport, the timing mix can cause a
rare mis-match or stale completion → one read returns the wrong bytes (page checksum/pageid “wrong”), SQL Server retries immediately, the next read hits the right buffer and “succeeds after failing 1 time”. This also explains why
WS2022 looks fine (different Storport timing), and why it shows up only with
newer drivers.
OK, but is there a clear problem visible in that commit? Simply said: yes!
At least for me, and even more for my virtual coworker (GPT-5 Thinking), these two lines (
1bbc422 vioscsi/vioscsi.c) are highly suspicious:
Code:
srbExt->id = adaptExt->last_srb_id;
adaptExt->last_srb_id++;
It's clear that the assignment and increment are not one atomic call (no lock, no interlocked op), so if occasionally this call is processed per-partes, the read attempt have to fail.
More details: On modern Storport (as in
Windows Server 2025), StartIo can run on multiple CPUs/queues much more aggressively than older builds, so two threads can actually read the same last_srb_id and hand out
duplicate IDs. When the host later completes those I/Os,
the driver can mismatch which SRB it completes or fail to find one (“No SRB found for ID”), causing a transient bad read that SQL Server retries — exactly the “succeeded after failing 1 time” messages mentioned above.
But please note: although it's not relevant to this specific bug, version 0.1.292 contains fresh, Sep 26, 2025 revert (commit a6d690a -
Revert "NO-SDV [vioscsi] Reduce spinlock management complexity") for the
@benyamin's Nov 20, 2024 initial commit
15e64ac - which is already included in 0.1.271. The corresponding PRs are
1175 and
1293. So this commit/revert is also generally suspicious (check the threads), but as the initial commit is already present in the very well working 0.1.271, I can safely refuse any potential negative impact to this read-retry problem. And although this commit/revert is mainly about potentially buggy driver initialization (without expected runtime influence), you've to be always cautious about similar commits/reverts.
Long story short: I've analyzed, isolated, solved, validated and described the root cause of this
read-retry issue on Windows Server 2025.
And I hope this long post will serve as either educational or motivational kick for you, to be aware that bug hunting, iterating and driver building is definitely not a nuclear science, as I served to myself as a zero-to-hero out of necessity. My GPT-5 Thinking buddy further generated probably viable code diff for the buggy commit
1bbc422, so this (my) revert is just a temporary solution.
Bottom line: I had completely zero previous experience with the windows driver build process. But with the help of GPT-5, it was really easy, in addition to this clear
virtio-win driver guide. But even if I've the EV code-signing certificate, still it's not possible to sign Windows Server drivers this way. This is really not simple process for an individual, but quite often for an medium company like Proxmox.
Technically, you will need to install and run Windows Hardware Lab Kit (HLK), process the required driver tests, submit all to MS Partner Central (here's your EV cert required - both for account authentication and signing the submission CAB file), and then Microsoft returns a dashboard/WHQL signature if all is clear. As you can see, this is definitely a no go for any bug hunting and testing.
Instead, there's a quite straightforward path - shipped virtio source (I mean the *.src.rpm and the internal-kvm-guest-drivers-windows folder) already includes
test-signing certificate (and the corresponding signing batch files), so all you need is
disable secure boot for the testing Windows VM (in the UEFI/boot phase), then
enable test-signing (via
bcdedit /set testsigning on) and at this point, you can install your fresh driver build.
In my case, I had installed complete virtio drivers for 0.1.285, and was just changing/testing the vioscsi drivers (every single build for every single commit/revert compiled). Here you should be careful to correctly version the builds (in the INF file). For instance, the baseline for 0.1.292 is 100.102.104.29200, so I created multiple subsequent 100.102.104.29201, 100.102.104.29202, 100.102.104.29203, etc. driver versions. It's really important to retain correct numbering, otherwise your driver list will be a complete mess (and you always need the correct upstream version binding).