Hi all, I'm the author of this ass-kicking post. After more than year, I'm back to service, as bug hunting never ends.
I'm sorry that I've missed many of the questions and messages, but let's move forward, we have another urgent problem(s).
In production, I have 100+ of 0.1.266 and a few 0.1.271 virtio versions. So far so good, no more scsi event log bugs.
But with the latest 0.1.285, it's another story. It's really crippled, especially with WS2025, and it's even more tricky.
TLDR: don't use 0.1.285 in production. Hidden (application) bugs are hard to find and cause brutal instability.
Speaking about WS2025, use rather 0.1.271. I cannot say that for another Windows versions it's the same, but it's better expect the worst.
Mentioned and broadly discussed "vioscsi Reset to device" event log bugs were extremely annoying, but relatively easy to find - at least there were present in Windows System log. But with 0.1.285 and WS2025, it's no more true.
It's not one single bug, it's not just about storage and vioscsi. It affects networking, storage and probably some other areas.
1) Storage issues - I'm aware of SQL Server-related bugs, but I'm quite sure it's omnipresent. Hidden bugs cause read retries, like this:
- A read of the file '*.mdf' at offset 0x00003897472000 succeeded after failing 1 time(s) with error:
incorrect checksum (expected: 0xad4c6778; actual: 0xad4c6778) - see that actual/expected are the same, i.e. a storage problem (read retry)
- A read of the file '*.mdf' at offset 0x00003897470000 succeeded after failing 1 time(s) with error:
incorrect pageid (expected 1:29669944; actual 1:29669944) - see that actual/expected are the same, i.e. a storage problem (read retry)
For the last two weeks I had the feeling I'm the only one affected, but it's not true - see e.g. this
@santiagobiali report:
Torn page error
What's going on? Hidden middleware/driver bugs cause (especially under high pressure) some kind of race conditions (my expectation), but OS is not aware of anything wrong, so only application mechanisms take care of this issue. Here the SQL Server is aware of some read retries, i.e. IO request is not correctly processed within the first try, but it's masked to messages related to DB corruption (but DBCC found nothing wrong).
After 2-3 days of such frequent SQL errors,
SQL Server service suddenly hangs. Only SQL Server log (or Windows Application log) is useful for debugging.
2) Networking issues - like iSCSI and general connection drops, invalid SMB signatures, AD errors, etc.
See the mentioned
W2025 virtio NIC -> connection drop outs
Other potentially related threads:
Proxmox VE 8.4 Extremely Poor Windows 11 VM Network Performance by
@anon314159
Windows & Linux VDI VMs feels So laggy by
@Hydra
Long story short: All I can say is that virtio nightmare is back, so check your systems, applications, workloads and be very cautious about 0.1.285.
I really believe in community and open-source development, but what's clear to me:
Proxmox Server Solutions GmbH needs more subscription customers to establish dedicated virtio testing and debugging team, as we simply cannot rely on some upstream RedHat-driven development, without serious release notes, stability reports, and quick bug analysis and resolutions.
CC
@fiona,
@fweber,
@t.lamprecht,
@fabian,
@aaron