Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Hi! For now, I'm focusing on the VirtIO SCSI (and, apparently also Block) problems with 0.1.285 reported here.

@RoCE-geek, thank you for your in-depth investigation of this bug, reporting this issue upstream, and providing a simple fio reproducer [1] as well. Your debugging efforts are much appreciated.
Your fio reproducer reports the same verification failures on Windows Server 2025 with virtio-win 0.1.285 for me. I still need to try with older virtio-win versions as well as a direct revert of the commit you pointed out.

As you can imagine, our expertise in Windows driver development is a bit more limited compared to Linux. So when we could not quickly reproduce the issue with a similar fio workload a few weeks ago, combined with our main focus being on preparing the Proxmox VE 9.1 release during the last few weeks, this issue did indeed receive less attention than it deserves.
We'll run some more tests and, if needed or useful, join the upstream discussion.

For everyone encountering this issue with virtio-win 0.1.285, downgrading to 0.1.271 seems to be a valid workaround for now, as already suggested by @RoCE-geek [2].

With regards to the other Windows Server 2025 issues you mention (1 and 2): I did not have time to look at the threads you referenced in detail yet, but will try to do that soon. As this thread here is already quite large, I'd suggest we keep it dedicated to the VirtIO SCSI/Block issues to avoid it becoming too confusing. Let's continue the discussion of the other Windows Server 2025 issues in the threads you linked.

[1] https://github.com/virtio-win/kvm-guest-drivers-windows/issues/1453#issuecomment-3527322212
[2] https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-812442
 
Hi! For now, I'm focusing on the VirtIO SCSI (and, apparently also Block) problems with 0.1.285 reported here.

@RoCE-geek, thank you for your in-depth investigation of this bug, reporting this issue upstream, and providing a simple fio reproducer [1] as well. Your debugging efforts are much appreciated.
Your fio reproducer reports the same verification failures on Windows Server 2025 with virtio-win 0.1.285 for me. I still need to try with older virtio-win versions as well as a direct revert of the commit you pointed out.

As you can imagine, our expertise in Windows driver development is a bit more limited compared to Linux. So when we could not quickly reproduce the issue with a similar fio workload a few weeks ago, combined with our main focus being on preparing the Proxmox VE 9.1 release during the last few weeks, this issue did indeed receive less attention than it deserves.
We'll run some more tests and, if needed or useful, join the upstream discussion.

For everyone encountering this issue with virtio-win 0.1.285, downgrading to 0.1.271 seems to be a valid workaround for now, as already suggested by @RoCE-geek [2].

With regards to the other Windows Server 2025 issues you mention (1 and 2): I did not have time to look at the threads you referenced in detail yet, but will try to do that soon. As this thread here is already quite large, I'd suggest we keep it dedicated to the VirtIO SCSI/Block issues to avoid it becoming too confusing. Let's continue the discussion of the other Windows Server 2025 issues in the threads you linked.

[1] https://github.com/virtio-win/kvm-guest-drivers-windows/issues/1453#issuecomment-3527322212
[2] https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-812442
Hi @fweber, thanks for the response, but to be clear, nothing is needed to cooperate on the vioscsi bug.

I've proposed 3 patches, @benyamin will add his own, so we HAVE a solution, and we're just benchmarking them. The final resolution and virtio PR will be soon.

And based on this, I moved to another problem, i.e. the crippled idle CPU state on WS2025/Win24H2+ and will report my new findings here: High VM-EXIT and Host CPU usage on idle with Windows Server 2025. It seems that I already isolated the problem (like with the vioscsi problem). At least on my side (all-EPYC infra), it's all about extensive Hyper-V calls, specifically:
  • STIMER0_CONFIG (0x400000b0)
  • STIMER0_COUNT (0x400000b1)
  • HV_X64_MSR_EOI (0x40000070)
  • HV_X64_MSR_ICR (0x40000071)

Long story short: current vioscsi problems are "almost resolved", no more help is needed at the moment.
 
And @fweber, of course I understand the omnipresent DEV buzz, so all is OK, just please, there should be some regular "bumps" in the rising threads from the Proxmox staff. We, as a community, are quite mighty and capable, but definitely not super-mighty. And everyone just needs to know that they are not alone in their suffering, and that their problems are not being ignored.
 
We faced this issue with Windows Server 2016 running SQL 2016 and virtio 0.1.285 drivers, had to downgrade to 0.1.271
 
Hi @fweber, thanks for the response, but to be clear, nothing is needed to cooperate on the vioscsi bug.

I've proposed 3 patches, @benyamin will add his own, so we HAVE a solution, and we're just benchmarking them. The final resolution and virtio PR will be soon.
Yes, I saw that (and thanks to @benyamin for preparing a PR!). But I think once a PR is ready, being able to contribute additional testing (with regards to the bug as well as to performance) can't hurt and might help to get the PR merged faster.
And based on this, I moved to another problem, i.e. the crippled idle CPU state on WS2025/Win24H2+ and will report my new findings here: High VM-EXIT and Host CPU usage on idle with Windows Server 2025. It seems that I already isolated the problem (like with the vioscsi problem). At least on my side (all-EPYC infra), it's all about extensive Hyper-V calls, specifically:
  • STIMER0_CONFIG (0x400000b0)
  • STIMER0_COUNT (0x400000b1)
  • HV_X64_MSR_EOI (0x40000070)
  • HV_X64_MSR_ICR (0x40000071)
I see. We'll try to reproduce the issue as well -- but let's please move further discussion of this to the dedicated thread [1].
And @fweber, of course I understand the omnipresent DEV buzz, so all is OK, just please, there should be some regular "bumps" in the rising threads from the Proxmox staff. We, as a community, are quite mighty and capable, but definitely not super-mighty. And everyone just needs to know that they are not alone in their suffering, and that their problems are not being ignored.
Initially, since i didn't manage to reproduce the issue, I decided to hold off posting until I have something substantial to report or ask, but I do see how a quick post, even if light on substance, might have been beneficial here.

[1] https://forum.proxmox.com/threads/h...n-idle-with-windows-server-2025.163564/page-3
 
Status update - @benyamin (almost) finished extensive testing, but given the obstacles he found along the way (in both fio and DiskSpd), there can't be clear and clean results for simple presentation (as it's more about future testbed normalization). Based on this I posted my recommendation (i.e. Patch 3 or Patch 4 - in normal working conditions they're identical), so I hope that maintainers will decide. And as @fweber is also present in that github thread, there's nothing pending on my side. Thanks to all who contributed.
 
We faced this issue with Windows Server 2016 running SQL 2016 and virtio 0.1.285 drivers, had to downgrade to 0.1.271
We running Window Server 2016 and SQL Server 2017 have two times auto stop Service SQL, need to restart the DB, can you describe the error you are experiencing for me?

Any one facing the issues same with us, we are using virtio driver 0.1.285. Thank all!
 
Perhaps the Proxmox team could consider building its own version of virtio drivers with its own installer, signed by Proxmox. This would allow for faster patch deployment, as is done for other subsystems (ZFS, etc.).

@fweber

This is what I've already suggested in my original post:
And here you can see another problem - release timing. It's not driven by the community needs, but by the RHEL releases / milestones.
And this is why I'm saying that Proxmox Server Solutions GmbH should fork the repo and maintain regular updates, i.e. based on user reports and/or bug-resolution importance. In case of 0.1.271, there was almost 3 months pointless delay.

Yesterday evening I made some waves to finally get some attention of the maintainers (I've tagged them at least few times in the manner: "maintainers may decide" - last attempt was in this post) and this is the response:
I think you misunderstand how volunteer open source projects work and how the team decided priorities.

Today is just Monday, and our team has a plan for future work items and bugs. First priority, as usual, is customers' bugs. This specific bug was not reported by our customers and was not reproduced by our QE.
So, currently, we have more urgent bugs for our team.

As you already have some patches after discussing, be a volunteer and submit a PR. We will review it in our free time.

Is it a joke? Is that RHEL guy serious? What a BS?

I'm the one how created the issue, did all the analysis and suggested at least three working patches with complex and detailed info, spent ~100 hours in total, and I don't understand "how volunteer open source projects work"?

And the best-ever message is "be a volunteer and submit a PR" - HA?? What am I than a volunteer??

Why I was tagging them? All I've asked for was "hey man, decide yourself", or "all seems acceptable, we're watching you, but we prefer Patch 3 as well".
But nothing happened, complete ignorace of myself. I'm just a Proxmox loser, who rolled his ball on the sandpit for big boys.
So the main fault is probably that I don't have the right logo on my shirt.

My only lesson for the next time is that I won't start any discussion or encourage anyone to respond, especially not the maintainers, I'll just report a bug, and if I'll know how to solve the problem (like in this case), I'll just submit a PR and won't wait for anyone or anything. How simple, no stress. But even that will not lead to a quick resolution of the problem. We will still have to wait weeks, or rather months, for a new version to be released, until it fits into the RHEL release schedule.

Anyone can make their own conclusion - who and how contributed, but in a pragmatic view, the Proxmox community interests cannot be efficiently defended in this way.

FYI @fiona, @fweber, @t.lamprecht, @fabian, @aaron
 
@RoCE-geek

I think you underestimate the scope of the virtio-win project and the efforts of the RH team and what goes on in the background, especially regarding QE.

If you think it is an easy thing to cater to millions of devices, maintain kernel-mode drivers for many targets, platforms and architectures (plus some user-mode stuff too), respond to your own customer demands, and have many, many third parties muddying the waters - you would be ill-informed.

The WHQL process is not trivial, nor is maintaining the HCK and a complex CI infrastructure that the process demands. Not to mention costly - like very.

Also, RH is a gargantuan beast, and whilst that has advantages, it also means they move slowly. When relying on the efforts of volunteers, they can appear to move slowly too. Sometimes this is due to background tasks that you may not be aware of, sometimes it's because life is more important.

It bears remembering that these are people, whether at Proxmox or at RH, and people will respond according to the way in which they are treated. It pays to to be somewhat diplomatic. Being demanding and critical certainly won't help. It just comes off as being ungrateful, and no one really wants to help a demanding ingrate - or at least someone who appears or comes across that way. Perception is reality after all, and we are all responsible for managing our own frustrations and behaviour. Letting that spill out on others is not helpful - it just makes it harder.

I am minded to mention the Dunning-Kruger effect, and would humbly and respectfully suggest, there is perhaps much to learn and there are indeed many good reasons as to why a "resolution" isn't forthcoming at a speed you find acceptable and that "pointless delays" are likely not pointless at all.

Good things come to those who wait.

...OR, you could write up a PR, do the clang-format checks, get rid of the code commentary, check AI isn't lying to you, do the necessary tests, do the mandatory SDV and CodeQL builds, do iterative performance CI, troubleshoot HCK failures, collaborate, take on feedback and constructive criticism, help fix other problems that affect your fixes, be mindful of your fellow volunteers and treat people with respect...
 
@RoCE-geek

I think you underestimate the scope of the virtio-win project and the efforts of the RH team and what goes on in the background, especially regarding QE.

If you think it is an easy thing to cater to millions of devices, maintain kernel-mode drivers for many targets, platforms and architectures (plus some user-mode stuff too), respond to your own customer demands, and have many, many third parties muddying the waters - you would be ill-informed.

The WHQL process is not trivial, nor is maintaining the HCK and a complex CI infrastructure that the process demands. Not to mention costly - like very.

Also, RH is a gargantuan beast, and whilst that has advantages, it also means they move slowly. When relying on the efforts of volunteers, they can appear to move slowly too. Sometimes this is due to background tasks that you may not be aware of, sometimes it's because life is more important.

It bears remembering that these are people, whether at Proxmox or at RH, and people will respond according to the way in which they are treated. It pays to to be somewhat diplomatic. Being demanding and critical certainly won't help. It just comes off as being ungrateful, and no one really wants to help a demanding ingrate - or at least someone who appears or comes across that way. Perception is reality after all, and we are all responsible for managing our own frustrations and behaviour. Letting that spill out on others is not helpful - it just makes it harder.

I am minded to mention the Dunning-Kruger effect, and would humbly and respectfully suggest, there is perhaps much to learn and there are indeed many good reasons as to why a "resolution" isn't forthcoming at a speed you find acceptable and that "pointless delays" are likely not pointless at all.

Good things come to those who wait.

...OR, you could write up a PR, do the clang-format checks, get rid of the code commentary, check AI isn't lying to you, do the necessary tests, do the mandatory SDV and CodeQL builds, do iterative performance CI, troubleshoot HCK failures, collaborate, take on feedback and constructive criticism, help fix other problems that affect your fixes, be mindful of your fellow volunteers and treat people with respect...
I don't take away anyone's view of the world, nor do I have any reason to convince anyone of my truth. All those fancy words about diplomacy and "respect" seem almost epic at times, but in reality they only underscore the current trend of political correctness, where form is elevated above content. If there is genuine respect and openness somewhere, not just declared as usual, then this applies to everyone without distinction. In a healthy and truly open environment, a "noob" does not have to convince an "expert" that something is wrong, as well as the principle of meritocracy does not apply.

This issue is about severe storage problem, aka possible silent data corruption. Invalid data may be read, but the application (or system) usually does not know about it. This is neither feature request, nor some cosmetic issue. This isn't about begging someone on your knees to look at it. Here, the problem was immediately analyzed by the author, not just reported, and several functional solutions were proposed within a short period of time.

And then it would have taken so little - just writing a response, especially after all the ideas and details of each step had been shared in the thread. So I'm sorry, but arguments like "everyone has it hard," "you don't know how complicated this project is" etc., are completely irrelevant. Any relativization does not change the essence of the problem, at least for me. The worst thing is when you have to convince someone that there is a bug. A bug that almost everyone actually suffers from, but in most cases is unaware of. So even if he himself is affected, he doesn't want to admit it. Suum cuique.
 
We running Window Server 2016 and SQL Server 2017 have two times auto stop Service SQL, need to restart the DB, can you describe the error you are experiencing for me?

Any one facing the issues same with us, we are using virtio driver 0.1.285. Thank all!

SQL service crashed and restarted randomly, probably when under high load, event logs had messaged like:

EventID 19019: "The MSSQLSERVER service terminated unexpectedly."

EventID 825: " A read of the file 'G:\Program Files\Microsoft SQL Server\MSSQL13.SQLCLU3\MSSQL\DATA\redacted.mdf' at offset 0x00003724244000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:28909858; actual
1:28909858). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check
(DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online."
EventID

Noticed when application using that SQL database logged constantly time outs with database.