W2025 virtio NIC -> connection drop outs

SQLServer also had issues with storage virtio driver 0.1.285-1 in a heavy-load Windows Server 2016 and 2019 VM (Disk is a ZFS vdev).

Code:
DESCRIPTION: Read from file 'H:\SQLSERVER\db01_3.mdf' at offset 0x000007023d0000 succeeded after failing 1 time with error: Torn page (expected signature: 0x00000001; actual signature: 0x6f898160). Other messages in the SQL Server error log and the operating system error log might contain more details. This error condition threatens the integrity of the database and must be corrected. Perform a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Also had some invalid SMB signatures with network driver 0.1.285-1 while transferring some large files (random).

Rolling back to version 0.1.271 solved both the issues.
 
SQLServer also had issues with storage virtio driver 0.1.285-1 in a heavy-load Windows Server 2016 and 2019 VM (Disk is a ZFS vdev).

Code:
DESCRIPTION: Read from file 'H:\SQLSERVER\db01_3.mdf' at offset 0x000007023d0000 succeeded after failing 1 time with error: Torn page (expected signature: 0x00000001; actual signature: 0x6f898160). Other messages in the SQL Server error log and the operating system error log might contain more details. This error condition threatens the integrity of the database and must be corrected. Perform a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Also had some invalid SMB signatures with network driver 0.1.285-1 while transferring some large files (random).

Rolling back to version 0.1.271 solved both the issues.
Hi @santiagobiali, I can confirm this behavior. WS2025 with SQL2022 and 0.1.285, highly-stressed test environment with such sql bugs, but no host or quest system errors (i.e. nothing like scsi reset, as I mentioned earlier: scsi reset bugs). So this is even more tricky, only Application errors, no System errors.

For now, I just downgraded vioscsi-only to 0.1.271, but it's too soon for a stability proof, will see a few days later.
But no incorrect pageid/checksum SQL errors since yesterday.

For a newcomer: Virtio 0.1.285 seems to be crippled massively, at least with WS2025 (and at least for NIC and vioscsi).
So use rather 0.1.271, even though I've seen some reports related to PVE 9.0 and/or 2025 (like HPET, balloon, RDS, ...), but still the go to version (I'm on PVE 8.4).

FIY @fiona, @fweber, @t.lamprecht - RED ALERT!

EDIT: Adding SQL Server log (and WS Application Event log) bugs I see with 0.1.285:

1) A read of the file '*.mdf' at offset 0x00003897472000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0xad4c6778; actual: 0xad4c6778) - see that actual/expected are the same, i.e. a storage problem (read retry)

2) A read of the file '*.mdf' at offset 0x00003897470000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:29669944; actual 1:29669944) - see that actual/expected are the same, i.e. a storage problem (read retry)

After 2-3 days of such SQL frequent errors, there's a complete SQL Server service hang!

No ZFS, just HW RAID with SSD and SCSI Single + IO thread + io_uring (default).
 
Last edited:
Hi @santiagobiali, I can confirm this behavior. WS2025 with SQL2022 and 0.1.285, highly-stressed test environment with such sql bugs, but no host or quest system errors (i.e. nothing like scsi reset, as I mentioned earlier: scsi reset bugs). So this is even more tricky, only Application errors, no System errors.

For now, I just downgraded vioscsi-only to 0.1.271, but it's too soon for a stability proof, will see a few days later.
But no incorrect pageid/checksum SQL errors since yesterday.

For a newcomer: Virtio 0.1.285 seems to be crippled massively, at least with WS2025 (and at least for NIC and vioscsi).
So use rather 0.1.271, even though I've seen some reports related to PVE 9.0 and/or 2025 (like HPET, balloon, RDS, ...), but still the go to version (I'm on PVE 8.4).

FIY @fiona, @fweber, @t.lamprecht - RED ALERT!

EDIT: Adding SQL Server log (and WS Application Event log) bugs I see with 0.1.285:

1) A read of the file '*.mdf' at offset 0x00003897472000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0xad4c6778; actual: 0xad4c6778) - see that actual/expected are the same, i.e. a storage problem (read retry)

2) A read of the file '*.mdf' at offset 0x00003897470000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:29669944; actual 1:29669944) - see that actual/expected are the same, i.e. a storage problem (read retry)

After 2-3 days of such SQL frequent errors, there's a complete SQL Server service hang!

No ZFS, just HW RAID with SSD and SCSI Single + IO thread + io_uring (default).

Have you fired bug report on virtio github?
 
Have you fired bug report on virtio github?
Hi @Whatever, not yet. But I moved the focus on the thread I've been active since the last year: Redhat VirtIO developers would like to coordinate

In my view, the problem with the virtio github is the low/lower interest of the core (RHEL) devs. And the key reason is that we are a "different gang", IMHO.

So anything potentially Proxmox-related is covered there with some kind of "maybe it's valid, maybe not" fog.
I've been following 2-3 threads there for a long time regarding "scsi reset" errors, and even though there were some highly-active community contributors like @benyamin (seems he's no longer active here on PVE), nothing really changed until some of the PVE devs wrote "Hi, I'm the PVE core dev...".

But I'm quite sure @fiona and @fweber were active there and I hope they will analyze and ideally file the reports, but given the complex crippled behavior of 0.1.285 we should be sure that PVE is really not the "root cause" here (although this is not specific to PVE 9.0, etc.).
 
  • Like
Reactions: Whatever
I've analyzed and isolated the storage-related problem (on WS2025 and 0.1.285+). And reported on virtio github as well.

But please, beware that my research work was strictly limited to the storage, i.e. vioscsi/viostor.

For the network related issues, you can do the same as myself - prepare a dev environment and experiment with various commits and reverts.

This is the only direct path to the resolution of any potential driver-related problems.

In all cases, any deep-reasoning LLM maybe be your best buddy.

Details are here: read-retry on WS2025 with SQL resolved
 
As I still wanna help you, this is the output of my quick GPT-5 Thinking session, regarding reports in this thread and the corresponding netkvm changes

Below is a focused delta review of NetKVM changes between those two points, what most plausibly explains the Server 2025 drop-outs, and what to try next.

What changed in NetKVM (Jan 13 → Jul 2, 2025)


From the NetKVM history around late Apr–Jun 2025, several control-queue (“CX”) changes landed. The ones most likely to cause periodic link drop/reinit on Server 2025 are:

  1. New “fail-fast” on unexpected control responses

    Commit 50e7db9 (“indicate driver error on unexpected CX behavior”) now calls NdisMRemoveMiniport (via a new _PARANDIS_ADAPTER::RaiseUnrecoverableError) whenever the driver sees an unexpected error code or a wrong response length from the control queue. That forcibly tears down the NIC (yellow bang), which looks exactly like brief disconnects/resets from the guest’s perspective. This code was not present in Jan and is merged before Jul 2.

  2. Control queue timeouts & pending-command logic

    Commit e4828d1 (“allow control command timeout”) added a pending-command tracker and changed the control path to continue after a timeout and re-check later when the next command/interrupt arrives. This changes timing and error-handling in the CX path; combined with (1), it can push the driver into the new fail-fast path if a late/odd response arrives.

  3. Abort waiting for response on surprise removal

    Commit 74e214d breaks out of the wait loop if the control queue stops being “ready,” again altering how borderline conditions are handled. On some hosts that can look like intermittent resets.

  4. Indirect TX path guard

    Commit d67c492 fixed using indirect descriptors when the device doesn’t advertise the feature. Safer overall; unlikely the culprit for drop-outs.

These landed between your safe and unsafe baselines and are the strongest regression suspects for “every 1–2 hours the link blips.”

Why it seems Windows Server 2025-specific


Two 2025-only signals showed up in this timeframe:

  • NdisPoll & 2k25 variant behavior – the NetKVM “2k25” flavor enables NDIS Poll by default. There are known Server 2025-only problems (hangs or discards) tied to NdisPoll on some platforms; disabling it fixes those cases. Even if your symptoms are drop-outs (not hangs), Server 2025 runs a different NDIS stack and scheduling model, so it’s more likely to tickle the new CX fail-fast code paths.
  • Reports from similar environments – Proxmox users report 0.1.285 drop-outs on Server 2025 and say reverting to 0.1.271 helps, which aligns with the commit window above.

Quick confirmation checklist (no rebuilds yet)

  1. Event ViewerSystem and Microsoft-Windows-NDIS logs: look for a miniport remove/restart around the outage timestamps. If you see NDIS “miniport removed/reinitialized,” that matches RaiseUnrecoverableError behavior added in 50e7db9.
  2. Proxmox thread alignment: users on 0.1.285 (your Jul 2 base) see drop-outs; revert to 0.1.271 mitigates.

Low-risk mitigations to try now

  • Stay on 0.1.271 (your current safe floor) for Server 2025 until a NetKVM fix lands. This matches community experience in Proxmox thread.
  • Disable NdisPoll on Server 2025 and retest on 0.1.285: Issue #1312 confirms 2k25 variant enables NdisPoll by default and that turning it off avoids severe 2k25 problems on some clouds. If your drop-outs vanish with NdisPoll=0, it further implicates the new CX error handling under the 2k25 runtime. (The issue describes setting NdisPoll=0 in registry for the 2k25 variant.)
  • Try 1 queue (no multiqueue) if you currently use multiple RX/TX queues; a Proxmox mod suggested it in the thread (not always necessary, but quick A/B).

If you’re up for a very targeted bisect (fast)


You already gave us the endpoints; here’s the shortest path to identify the exact breaker without testing 140+ commits:

  1. Start at Jan 13 (0e263be) → OK; end at Jul 2 (bd965ef) → BAD.
  2. Jump straight to the CX cluster:
    • Test a build with 50e7db9 reverted (keep other changes). If drop-outs stop, that commit is the primary breaker.
    • If still BAD, also revert e4828d1 (control timeouts). If that cures it, the timeout logic is interacting badly on 2025 hosts.
  3. If you’d rather not cherry-pick, classic git bisect restricted to NetKVM/ will converge quickly because most churn here is in NetKVM/Common/ParaNdis_*.{cpp,h} in late spring 2025. (You can ignore viostor/viogpu/etc. during the bisect.)

Extra: how to capture useful driver logs

  • Enable NetKVM WPP tracing (documented by the project) to catch CX warnings/errors around the blip. This needs no code changes and will show if the control path hit the unexpected-response/length branch.
  • netkvm WMI tool is available if you need to tweak/inspect NetKVM at runtime.



Bottom line

  • Your observation (“0.1.271 good” vs “0.1.285 bad” on Server 2025) lines up with NetKVM’s new CX fail-fast behavior introduced in 50e7db9 and related CX timing changes (e4828d1, 74e214d). Those changes can force a miniport removal on certain control-queue anomalies—experienced by users as periodic drop-outs/resets.
  • There are independent Server-2025-only NetKVM problems (NdisPoll defaults, packet discards) in the same period, which further explains why 2025 is uniquely noisy.
  • Safest mitigation today: stick with 0.1.271 on 2025, or test 0.1.285 with NdisPoll=0; if you can build, try reverting 50e7db9 as a surgical fix to confirm root cause.
 
Can anyone affected check the logs?

It seems I'm clear... but look, I'm still on PVE 8.4.

Other VM-WS2025 details (with 0.1.285): OVMF, PC-Q35-8.2, NIC Virtio (paravirtualized)

Another GPT hints for the log analysis (it's based on the symptoms in this thread and some commit details, but still, reality may be different):



Here’s what to watch in Event Viewer when the NetKVM link “blips” happen (Windows Server 2025):

  1. System → Source: “Microsoft-Windows-NDIS”
    • Event ID 10400 (Warning) — “The network interface … has begun resetting… Reason: …” This is the most telling marker that the NIC driver was reset by NDIS (often because the driver requested a reset or it stopped responding). Filter to your virtio NIC name (e.g., Red Hat VirtIO Ethernet Adapter).
  2. System → Source: “Tcpip”
    • Event ID 4202 (Information) — adapter disconnected; configuration released.
    • Event ID 4201 (Information) — adapter connected; normal operation resumed.

      These typically bracket the dropout around the NDIS reset.
  3. Applications and Services Logs → Microsoft → Windows → NetworkProfile → Operational
    • Event ID 10001 = Disconnected, Event ID 10000 = Connected.

      Good corroboration that Windows re-identified the network right when the drop occurred.
  4. Applications and Services Logs → Microsoft → Windows → NCSI → Operational
    • Look for capability/Internet-reachability changes around those times (NCSI may show loss/restoration of Internet).
  5. System → Source: “Kernel-PnP”
    • Event ID 219 (Warning) or related PnP entries near the drop can indicate device/driver hiccups during install/start/stop cycles. Not required for every case, but useful context if present.
  6. (Advanced) Applications and Services Logs → Microsoft → Windows → NDIS → Operational
    • Extra per-NDIS diagnostics if enabled; handy for deeper resets/link-state tracing.

Quick filters you can paste into PowerShell

  • NDIS resets in the suspect window (after upgrade to 0.1.285):
Code:
Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='Microsoft-Windows-NDIS'; Id=10400;  StartTime='XXXX-XX-XX'; EndTime='XXXX-XX-XX'} | Select TimeCreated, Message

  • TCP/IP connect/disconnect around drops:
Code:
Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='Tcpip'; Id=4201,4202; StartTime='XXXX-XX-XX'; EndTime='XXXX-XX-XX'} | Select TimeCreated, Id, Message

  • Network profile connect/disconnect:
Code:
Get-WinEvent -LogName 'Microsoft-Windows-NetworkProfile/Operational' | Where-Object { $_.Id -in 10000,10001 } | Select TimeCreated, Id, Message

What a “match” looks like


On affected hosts you’ll usually see a Warning 10400 from Microsoft-Windows-NDIS saying the NIC began resetting (reason often “driver requested reset” or “hardware stopped responding”), closely followed by Tcpip 4202 → 4201 and NetworkProfile 10001 → 10000. Users on Proxmox have reported this pattern with virtio-win 0.1.285 on Windows Server 2025, with rollbacks to 0.1.271 avoiding the drops.