Windows Server 2019 - vioscsi Warning (129) - locking up data drive

davemcl · Jun 27, 2023

I would try the things I outlined on the latest PVE 7.4 in that case.

davemcl · Jun 28, 2023

Ive now managed to upgrade my cluster to PVE 8 - will update if that fixes the issue for me.

davemcl · Jul 4, 2023

On PVE8 I got the kvm: Desc next is 3 again
Disk didnt fail entirely but SQL was broken until I rebooted so same result.

I have this occuring on 2 servers, the one it occurs most on is SATA SSD based RAID10, the other where it has occured only twice since April is SAS based SSD RAID10. The SATA system also had its PERC RAID controller in writeback mode, Ive now changed this to write through.

gk_emmo · Aug 30, 2023

Did anybody figure this out?

We started to see these warnings. and problems arose on VMs with any sensitive service like Exchange, or standard SQL.

We are using PVE 8.03 with Ceph, in a 7-node cluster. I've tried all the suggestions in this and other threads, without any success.

It is causing big trouble for us.

davemcl · Aug 30, 2023

gk_emmo said:
Did anybody figure this out?

We started to see these warnings. and problems arose on VMs with any sensitive service like Exchange, or standard SQL.

We are using PVE 8.03 with Ceph, in a 7-node cluster. I've tried all the suggestions in this and other threads, without any success.

It is causing big trouble for us.

Its also being discussed here
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756

RHEL are also tracking it but I dont have access and there was a PR from Nutanix recently to bugcheck on Windows when a reset occurs.
https://github.com/virtio-win/kvm-g...mmit/eff7b43073b95c563d2e5c6b0d7b6dd954f1232a

gk_emmo · Aug 30, 2023

davemcl said:
Its also being discussed here
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756

RHEL are also tracking it but I dont have access and there was a PR from Nutanix recently to bugcheck on Windows when a reset occurs.
https://github.com/virtio-win/kvm-g...mmit/eff7b43073b95c563d2e5c6b0d7b6dd954f1232a

I just read through myself on the Github link before i posted here

I am a 100% sure that this error is occuring at others.

We started to think, that maybe it is caused by the lack of bandwidth on the backend. We are using 2x10Gbit with 9000 MTU for public and cluster networks. Switches are not overwhelmed, and error is occuring randomly, but this, and the Proxmox/Ceph version update is the only thing what happened before the errors started to happen.

@davemcl did you tried these settings:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters
- IoTimeoutValue = 0x5a (90)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
- PhysicalBreaks = 0x3f (63)

davemcl · Aug 30, 2023

gk_emmo said:
I just read through myself on the Github link before i posted here

I am a 100% sure that this error is occuring at others.

We started to think, that maybe it is caused by the lack of bandwidth on the backend. We are using 2x10Gbit with 9000 MTU for public and cluster networks. Switches are not overwhelmed, and error is occuring randomly, but this, and the Proxmox/Ceph version update is the only thing what happened before the errors started to happen.

It occurs on 2 SQL servers for me with local storage (SATA & SAS enterprise SSD's in RAID-10) when there is high IO during an ETL run or processing cubes.

gk_emmo · Aug 30, 2023

davemcl said:
It occurs on 2 SQL servers for me with local storage (SATA & SAS enterprise SSD's in RAID-10) when there is high IO during an ETL run or processing cubes.

Wow, so CEPH can't be, or most likely not the issue. That is a surprise for me. Then it can only be virtio / QEMU? Awesome.

xiaopo · Dec 25, 2023

The change of aio -> native will lead the system slow reflect speed

nsc117 · Dec 26, 2023

Just ran into this exact issue.

Tobias.F · Dec 31, 2023

Same issue here with pve 8.1.3.

gk_emmo said:
Did anybody figure this out?
We started to see these warnings. and problems arose on VMs with any sensitive service like Exchange, or standard SQL.

I am happy to see this conversation. I discovered the same issue when I changed motherboard and CPU of my system. For any reason the issue was not present in my old set-up. But it helps me to know that this seems to be a general issue. In my case Exchange and Veeam B&R with PostgreSQL are affected.

My issue
https://forum.proxmox.com/threads/io-delay-how-to-find-out-reason-for.138990/

davemcl · Jan 1, 2024

nsc117 said:
Just ran into this exact issue.

What hardware and PVE/kernel versions are you running on?

davemcl · Jan 1, 2024

An update...
I was hitting this issue weekly on a certain server for nearly 12 months (in October it occurred 6 times) and has been a massive pain in the arse.
Hasnt re-occurred since November 8, 2023 and workloads havent changed.
I also eval'ed XCP-NG for quite some time and the issue doesnt occur, but XCP is unacceptably slow when using local disks and backups take forever so I didnt migrate.
I have found better disk IOPS dont fix this either - issue occured on a GENOA server with Gen4 NVMe drives (RAID 10)

fireon · Jan 17, 2024

Same here on Windowsserver 2019 mit SQL Server 2019. But only after the upgrade to Proxmox 8.x. The problem didn't exist before. I have to mention that the Windows VM was also updated, including SQL Server updates. It does not always occur. Every few days... I will switch to CPU Host and install the latest VirtIO drivers, maybe this will bring a positive change.

exitsys · Jan 18, 2024

fireon said:
Same here on Windowsserver 2019 mit SQL Server 2019. But only after the upgrade to Proxmox 8.x. The problem didn't exist before. I have to mention that the Windows VM was also updated, including SQL Server updates. It does not always occur. Every few days... I will switch to CPU Host and install the latest VirtIO drivers, maybe this will bring a positive change.

I have the latest drivers and the CPU is set to Host. Error still occurs. However, I have the feeling that it occurs much more often on systems with ZFS as the storage system for the Vms than on my Ceph cluster.

fireon · Jan 18, 2024

exitsys said:
I have the latest drivers and the CPU is set to Host. Error still occurs. However, I have the feeling that it occurs much more often on systems with ZFS as the storage system for the Vms than on my Ceph cluster.

Here we have to old DELL Servers with LVM-Thin.

Max2048 · Mar 1, 2024

Issue is still happening on Proxmox 8.1-4 fully patched systems (using CEPH as storage backend) although the backup lockups have been fixed... now it's just the general lockups left

davemcl · Mar 2, 2024

Hadnt had this occur since November 23 then got this the other day.
So not resolved but happening a lot less.

Feb 25 08:07:30.022692 QEMU[1189154]: kvm: Desc next is 17

emunt6 · Mar 3, 2024

Very old problem ( since pve 3.x )

The solution/fix is simple:
- Don't use any "virtio" based emulation for WINDOWS based VM.

There was always problems with MS SQL server in VM (running backup-jobs always triggerred: virtio-disk dropping/disappearing), i gave up virtio for WINDOWS long time ago, no problems since then.

Other VMs ( Linux, FreeBSD ) virtio is working correctly.

davemcl · Mar 7, 2024

emunt6 said:
Very old problem ( since pve 3.x )

The solution/fix is simple:
- Don't use any "virtio" based emulation for WINDOWS based VM.

There was always problems with MS SQL server in VM (running backup-jobs always triggerred: virtio-disk dropping/disappearing), i gave up virtio for WINDOWS long time ago, no problems since then.

Other VMs ( Linux, FreeBSD ) virtio is working correctly.

Dropping VirtIO drivers would tank performance, you may as well look at another hypervisor at that point.

Windows Server 2019 - vioscsi Warning (129) - locking up data drive

Member

Member

Member

Member

Member

Member

Member

Member

New Member

New Member

New Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Active Member

Member

We value your privacy