Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

We’ve seen similar behavior but we only use cache=writeback in our environment for high performance. In our testing, only virtio-scsi versions 0.1.204 and earlier remain stable under high I/O. Anything from 0.1.208 onward becomes unstable during stress and we can easily reproduce instability when running multiple synthetic workloads with DiskSpd. Without cache=writeback, 0.1.208, 0.1.266 and 0.1.271 appear stable but we don't want to give up writeback caching.
 
This has been my anecdotal experience as well. I installed 0.1.285 on a windows/sql host and it caused a lot of suspect virtio-related messages in event viewer. Rolling it to the version before that (0.1.271) seems to have cleared it up. The older established bug-free versions (0.1.204 and I think .208) also still work well, though like the rest of us I’m sure, I worry a little about not getting the other unrelated bugfixes etc in the newer versions.

Thanks for continuing your efforts here!


In our environment, we use dozens of Windows Server 2019/2022, and we don't see any issues with version 1.285 drivers. However, there is only 1 server with the MSSQL database engine (2019, if I'm not mistaken), and I'm not entirely sure if it has been updated to the virtio drives version 1.285 or not.


P.S. @RoCE-geek
Maybe you can write a T-SQL script to reproduce the problem? I'm sure this can greatly speed up the solution.
 
Last edited:
P.S. @RoCE-geek
Maybe you can write a T-SQL script to reproduce the problem? I'm sure this can greatly speed up the solution.
This can be hard, because there's no clear initial condition. We have another two VMs with WS2025 and SQL2022, but there's a low traffic/load, so these errors/issues are quite rare and no service hangs so far.

But I'm quite sure that majority of setups with WS2025 and SQL Server are affected. Even with the low SQL load, there are similar reports/errors, at least a few dozens per week, but no one is probably aware of. Only if you dig into the SQL logs or Windows Application logs, you can find similar reports. But still, they are all "just" informative, no warnings, no errors. Just a sign there's something buggy inside the storage/VM stack.

Last but not least, given it's not just about the storage, as network may be affected as well, there's probably something more general to the Windows Server 2025 stack differences, i.e. maybe some general abstraction layer is the root cause.

But as always, deep and comprehensive diff between 0.1.271 and 0.1.285 should be a starting point for the initial analysis.
 
  • Like
Reactions: Whatever
I did some new tests, this time with an unreleased, Pre-WHQL (i.e. non-certified) vioscsi driver 0.1.292, found here: attestation-virtio-win-prewhql-0.1-292.zip

And it was quite fast, the bug is present, still with the high incidence, so even the new, unreleased-yet driver is buggy, but those affected with the network issues can try the newer NetKVM driver.

So still go with the 0.1.271 only for the WS2025 and SQL Server combo, it's the "safe" rollback solution for the problems described here.