W2025 virtio NIC -> connection drop outs

SQLServer also had issues with storage virtio driver 0.1.285-1 in a heavy-load Windows Server 2016 and 2019 VM (Disk is a ZFS vdev).

Code:
DESCRIPTION: Read from file 'H:\SQLSERVER\db01_3.mdf' at offset 0x000007023d0000 succeeded after failing 1 time with error: Torn page (expected signature: 0x00000001; actual signature: 0x6f898160). Other messages in the SQL Server error log and the operating system error log might contain more details. This error condition threatens the integrity of the database and must be corrected. Perform a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Also had some invalid SMB signatures with network driver 0.1.285-1 while transferring some large files (random).

Rolling back to version 0.1.271 solved both the issues.
 
SQLServer also had issues with storage virtio driver 0.1.285-1 in a heavy-load Windows Server 2016 and 2019 VM (Disk is a ZFS vdev).

Code:
DESCRIPTION: Read from file 'H:\SQLSERVER\db01_3.mdf' at offset 0x000007023d0000 succeeded after failing 1 time with error: Torn page (expected signature: 0x00000001; actual signature: 0x6f898160). Other messages in the SQL Server error log and the operating system error log might contain more details. This error condition threatens the integrity of the database and must be corrected. Perform a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Also had some invalid SMB signatures with network driver 0.1.285-1 while transferring some large files (random).

Rolling back to version 0.1.271 solved both the issues.
Hi @santiagobiali, I can confirm this behavior. WS2025 with SQL2022 and 0.1.285, highly-stressed test environment with such sql bugs, but no host or quest system errors (i.e. nothing like scsi reset, as I mentioned earlier: scsi reset bugs). So this is even more tricky, only Application errors, no System errors.

For now, I just downgraded vioscsi-only to 0.1.271, but it's too soon for a stability proof, will see a few days later.
But no incorrect pageid/checksum SQL errors since yesterday.

For a newcomer: Virtio 0.1.285 seems to be crippled massively, at least with WS2025 (and at least for NIC and vioscsi).
So use rather 0.1.271, even though I've seen some reports related to PVE 9.0 and/or 2025 (like HPET, balloon, RDS, ...), but still the go to version (I'm on PVE 8.4).

FIY @fiona, @fweber, @t.lamprecht - RED ALERT!

EDIT: Adding SQL Server log (and WS Application Event log) bugs I see with 0.1.285:

1) A read of the file '*.mdf' at offset 0x00003897472000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0xad4c6778; actual: 0xad4c6778) - see that actual/expected are the same, i.e. a storage problem (read retry)

2) A read of the file '*.mdf' at offset 0x00003897470000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:29669944; actual 1:29669944) - see that actual/expected are the same, i.e. a storage problem (read retry)

After 2-3 days of such SQL frequent errors, there's a complete SQL Server service hang!

No ZFS, just HW RAID with SSD and SCSI Single + IO thread + io_uring (default).
 
Last edited:
Hi @santiagobiali, I can confirm this behavior. WS2025 with SQL2022 and 0.1.285, highly-stressed test environment with such sql bugs, but no host or quest system errors (i.e. nothing like scsi reset, as I mentioned earlier: scsi reset bugs). So this is even more tricky, only Application errors, no System errors.

For now, I just downgraded vioscsi-only to 0.1.271, but it's too soon for a stability proof, will see a few days later.
But no incorrect pageid/checksum SQL errors since yesterday.

For a newcomer: Virtio 0.1.285 seems to be crippled massively, at least with WS2025 (and at least for NIC and vioscsi).
So use rather 0.1.271, even though I've seen some reports related to PVE 9.0 and/or 2025 (like HPET, balloon, RDS, ...), but still the go to version (I'm on PVE 8.4).

FIY @fiona, @fweber, @t.lamprecht - RED ALERT!

EDIT: Adding SQL Server log (and WS Application Event log) bugs I see with 0.1.285:

1) A read of the file '*.mdf' at offset 0x00003897472000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0xad4c6778; actual: 0xad4c6778) - see that actual/expected are the same, i.e. a storage problem (read retry)

2) A read of the file '*.mdf' at offset 0x00003897470000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:29669944; actual 1:29669944) - see that actual/expected are the same, i.e. a storage problem (read retry)

After 2-3 days of such SQL frequent errors, there's a complete SQL Server service hang!

No ZFS, just HW RAID with SSD and SCSI Single + IO thread + io_uring (default).

Have you fired bug report on virtio github?
 
Have you fired bug report on virtio github?
Hi @Whatever, not yet. But I moved the focus on the thread I've been active since the last year: Redhat VirtIO developers would like to coordinate

In my view, the problem with the virtio github is the low/lower interest of the core (RHEL) devs. And the key reason is that we are a "different gang", IMHO.

So anything potentially Proxmox-related is covered there with some kind of "maybe it's valid, maybe not" fog.
I've been following 2-3 threads there for a long time regarding "scsi reset" errors, and even though there were some highly-active community contributors like @benyamin (seems he's no longer active here on PVE), nothing really changed until some of the PVE devs wrote "Hi, I'm the PVE core dev...".

But I'm quite sure @fiona and @fweber were active there and I hope they will analyze and ideally file the reports, but given the complex crippled behavior of 0.1.285 we should be sure that PVE is really not the "root cause" here (although this is not specific to PVE 9.0, etc.).
 
  • Like
Reactions: Whatever