What is "kvm: Desc next is 3" indicative of?

davemcl · Jun 11, 2023

I have had another
Jun 11 19:29:24 pve03 QEMU[2198]: kvm: Desc next is 3
on the host, this results in the following in Windows guest event log
Reset to device, \Device\RaidPort4, was issued.
and the VM gets locked up and cant write to disks till a full reboot is made. Actually, I usually have to stop the VM due to it not responding at all.

The error message seems to originate here

https://github.com/qemu/qemu/blob/c3f9aa8e488db330197c9217e38555f6772e8f07/hw/virtio/virtio.c#L1057

Anyone have a clue why this is happening?

I also get
Jun 09 19:09:50 pve03 QEMU[2389459]: kvm: virtio: zero sized buffers are not allowed
followed by the same behaviour in Windows guests.

fiona · Jun 12, 2023

Hi,
please share the output of pveversion -v and qm config <ID> with the ID of your VM. Do these errors happen when you do something specific? Can you share a bit more of the surrounding log?

davemcl · Jun 12, 2023

Code:

proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.1: 7.3-6
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.2
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:

agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 12
cpu: Skylake-Server
description: #### Windows Server 2016%0A#### SQL Server 2019%0A
efidisk0: SSD-R10:vm-201-disk-0,efitype=4m,format=raw,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-7.1
memory: 61440
meta: creation-qemu=7.1.0,ctime=1676181242
name: IRIS
net0: virtio=86:DB:0D:46:C1:58,bridge=vmbr0,firewall=1,tag=25
numa: 1
onboot: 1
ostype: win10
scsi0: SSD-R10:vm-201-disk-1,discard=on,format=raw,iothread=1,size=96G
scsi1: SSD-R10:vm-201-disk-2,discard=on,iothread=1,size=320G
scsi2: SSD-R10:vm-201-disk-3,discard=on,iothread=1,size=128G
scsi3: SSD-R10:vm-201-disk-4,discard=on,iothread=1,size=64G
scsihw: virtio-scsi-single
smbios1: uuid=973a474f-45aa-4caa-ab0e-bace1c0aa76e
sockets: 1
tags: windows
vmgenid: 62178394-12fe-4df7-ae25-839471658f30

These VM's pretty much have a host all to themselves as SQL & Analysis Services is doing lots of reads/writes.

The host syslog doesnt really point to anything

Code:

Jun 09 19:06:55 pve03 pmxcfs[1672]: [status] notice: received log
Jun 09 19:07:48 pve03 pveproxy[204371]: worker exit
Jun 09 19:07:48 pve03 pveproxy[1836]: worker 204371 finished
Jun 09 19:07:48 pve03 pveproxy[1836]: starting 1 worker(s)
Jun 09 19:07:48 pve03 pveproxy[1836]: worker 267905 started
Jun 09 19:09:25 pve03 smartd[1402]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73
Jun 09 19:09:25 pve03 smartd[1402]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 74
Jun 09 19:09:25 pve03 smartd[1402]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 74
Jun 09 19:09:25 pve03 smartd[1402]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 74
Jun 09 19:09:50 pve03 QEMU[2389459]: kvm: virtio: zero sized buffers are not allowed
Jun 09 19:14:26 pve03 pmxcfs[1672]: [dcdb] notice: data verification successful
Jun 09 19:17:01 pve03 CRON[274691]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 09 19:17:01 pve03 CRON[274692]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 09 19:17:01 pve03 CRON[274691]: pam_unix(cron:session): session closed for user root

I have also tried all 3 kernels 5.15, 6.1 & 6.2

It occurs quite randomly too, 2 nights in a row, then good for a few days.

Thanks

davemcl · Jun 12, 2023

Is there a straight forward way to monitor disk IO of the 4 VM drives, so I can reconcile the issue happening at a time of high IO?

fiona · Jun 12, 2023

I've searched around a bit and I found something that might be related: https://gitlab.com/qemu-project/qemu/-/commit/bbc1c327d7974261c61566cdb950cc5fa0196b41

While the symptoms described there are worse (assertion failure), it's the same code and maybe the issue manifests differently in your case. The fix is included in QEMU 8.0 which is the version used by the upcoming Proxmox VE 8. It might also land in Proxmox VE 7 as part of QEMU 7.2.3, but might take a bit of time.

davemcl · Jun 12, 2023

Thanks for looking into it.

davemcl · Jun 13, 2023

This also feels similar/relevant
https://bugzilla.kernel.org/show_bug.cgi?id=199727

fiona · Jun 13, 2023

davemcl said:
This also feels similar/relevant
https://bugzilla.kernel.org/show_bug.cgi?id=199727

You are using iothread for your disks and this issue was reported before io_uring, which further improves things, was the default, so I'd really be surprised if that issue would be relevant.

davemcl · Jun 13, 2023

Towards the bottom of the thread io_uring is mentioned quite a bit.
Theres also a comment about avoiding io_uring on lvm but its from 2021.

fiona · Jun 13, 2023

davemcl said:
Towards the bottom of the thread io_uring is mentioned quite a bit.
Theres also a comment about avoiding io_uring on lvm but its from 2021.

Yes, there were some issues with io_uring in the beginning. And for certain storages like LVM, it's still not used as the default. But otherwise it should be stable now. If you are using iothread, IO is handled in a dedicated thread and not in the main thread, so it doesn't block virtual CPUs.

davemcl · Jun 15, 2023

Got it again - does that trace confirm anything?

QEMU[1961]: kvm: virtio: zero sized buffers are not allowed

Code:

strace -c -p $(cat /var/run/qemu-server/201.pid)
strace: Process 1961 attached
^Cstrace: Process 1961 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.41    0.377933         330      1145           ppoll
  2.38    0.009431           2      4080           write
  1.42    0.005609           5      1098           read
  0.69    0.002730           2       998           recvmsg
  0.03    0.000138           2        53         1 futex
  0.03    0.000119           5        20           sendmsg
  0.02    0.000072           2        30           ioctl
  0.01    0.000034           8         4           close
  0.00    0.000015           3         4           accept4
  0.00    0.000010           1         8           fcntl
  0.00    0.000005           1         4           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00    0.396096          53      7444         1 total

Code:

root@pve03:~# qm config 201
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 12
cpu: Skylake-Server
description: #### Windows Server 2016%0A#### SQL Server 2019
efidisk0: SSD-R10:vm-201-disk-0,efitype=4m,format=raw,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-7.1
memory: 61440
meta: creation-qemu=7.1.0,ctime=1676181242
name: IRIS
net0: virtio=86:DB:0D:46:C1:58,bridge=vmbr0,firewall=1,tag=25
numa: 1
onboot: 1
ostype: win10
scsi0: SSD-R10:vm-201-disk-1,discard=on,format=raw,iothread=1,size=96G
scsi1: SSD-R10:vm-201-disk-2,discard=on,iothread=1,size=320G
scsi2: SSD-R10:vm-201-disk-3,discard=on,iothread=1,size=160G
scsi3: SSD-R10:vm-201-disk-4,discard=on,iothread=1,size=64G
scsi4: SSD-R10:vm-201-disk-5,discard=on,iothread=1,size=448G
scsihw: virtio-scsi-single
smbios1: uuid=973a474f-45aa-4caa-ab0e-bace1c0aa76e
sockets: 1
tags: ims;windows
vmgenid: 62178394-12fe-4df7-ae25-839471658f30

davemcl · Jun 28, 2023

fiona said:
I've searched around a bit and I found something that might be related: https://gitlab.com/qemu-project/qemu/-/commit/bbc1c327d7974261c61566cdb950cc5fa0196b41

While the symptoms described there are worse (assertion failure), it's the same code and maybe the issue manifests differently in your case. The fix is included in QEMU 8.0 which is the version used by the upcoming Proxmox VE 8. It might also land in Proxmox VE 7 as part of QEMU 7.2.3, but might take a bit of time.

Am now on PVE8, will see how things go...

davemcl · Jul 4, 2023

Issue still happening on PVE8. Not sure what else to try.

davemcl · Jul 27, 2023

I managed to reproduce the error QEMU[xxxx]: kvm: Desc next is 3 error using Microsofts DiskSpd tool.
https://github.com/microsoft/diskspd

Create Windows 2019 Server VM with 8 cores/1 socket and 12GB of RAM - using latest VirtIO drivers 0.1.229
Machine = pc-q35-8.0, Controller = VirtIO SCSI Single,
My Drives:
scsi0: tank:vm-110-disk-2,discard=on,iothread=1,size=48G,ssd=1,aio=io_uring
scsi1: tank:vm-110-disk-1,discard=on,iothread=1,size=112G,ssd=1, aio=io_uring
I have ballooning enabled - not sure if relevant & min/max memory set to 12GB
Create 2nd drive, format with NTFS & 64k allocation unit size.
Download Diskspd & extract x64 binaries to C:\Windows
Create 2 test files in D:\diskspd - from powershell
diskspd -c8G D:\diskspd\iotest2.dat
diskspd -c8G D:\diskspd\iotest3.dat
Fill files with random data
diskspd.exe -w100 -Zr -d460 .\iotest2.dat
diskspd.exe -w100 -Zr -d460 .\iotest3.dat
Open 2nd powershell and run next 2 commands concurrently
diskspd -b8K -d400 -Sh -L -o32 -t8 -r -w0 .\iotest2.dat
diskspd -b64K -d400 -Sh -L -o32 -t8 -r -w0 .\iotest3.dat

If last step completes and outputs stats, re-run again.
Underlying storage on my test is RAID10 ZFS pool on 4 x Samsung 883 SSD (the higher the iops, the less this error occurs for me)
From 5 attempts I got it to throw 4 times... YMMV

exerit · Nov 10, 2023

Hi,
I have a similar problem with host
Windows Server 2022 Standard
with SQL
on promox. Reporting an error in the log and corrupting random databases:"Reset to device, \Device\RaidPort2, was issued."

Code:

root@pve:~# qm config 102
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 4
cpu: host
efidisk0: local-lvm:vm-102-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
machine: pc-q35-8.0
memory: 32768
meta: creation-qemu=7.1.0,ctime=1673047081
name: db1
net0: virtio=62:CE:10:0B:E8:EC,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: CzteryBazyuUzkodzone
scsi0: local-lvm:vm-102-disk-1,cache=writeback,discard=on,size=150G
scsi1: local-lvm:vm-102-disk-2,cache=writeback,discard=on,size=300G
scsi2: local-lvm:vm-102-disk-3,cache=writeback,discard=on,size=400G
scsihw: virtio-scsi-single
smbios1: uuid=57f34b1c-b9a6-41fb-aa49-fd44afaec0ae
sockets: 1
startup: order=2,up=5
tpmstate0: local-lvm:vm-102-disk-4,size=4M,version=v2.0
usb0: host=090c:1000
usb1: host=058f:6387
usb2: host=096e:0201
vmgenid: 838f76d0-34e9-4faf-8a73-27a4de3f3b66
root@pve:~#

Code:

root@pve:~#
proxmox-ve: 8.0.2 (running kernel: 6.2.16-19-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.9
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.4-1
proxmox-backup-file-restore: 3.0.4-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.4
pve-container: 5.0.5
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-3
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-7
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1

Can someone help me with what to do and where to look for the problem?
Thanks!

davemcl · Nov 11, 2023

This issue is also being looked at by the VirtIO guest driver people over here
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756

The only thing Ive found that mitigates it mostly (not totally) is faster storage.
SAS SSD Raid 10 - 3 times since April 2023
SATA SSD Raid 10 - 22 times since June 2023

davemcl · Nov 13, 2023

Ive been playing with dropping back the machine type from
pc-q35-8.0
to
pc-q35-5.1
Have done 5 test runs with diskspd and havent seen this or similar issues occur.
On boot I only had a C: drive - other drives had to be re-added to the OS via "Disk Management", youll need to do a bit of cleanup in device management too.
Im assuming this would cause a license re-activation too.

halt · Nov 13, 2023

davemcl said:
Ive been playing with dropping back the machine type from
pc-q35-8.0
to
pc-q35-5.1
Have done 5 test runs with diskspd and havent seen this or similar issues occur.
On boot I only had a C: drive - other drives had to be re-added to the OS via "Disk Management", youll need to do a bit of cleanup in device management too.
Im assuming this would cause a license re-activation too.

Thanks for your reply! We have the same problem. You used local storage on ext4? do you tested on lvm stogage ?

davemcl · Nov 13, 2023

halt said:
Thanks for your reply! We have the same problem. You used local storage on ext4? do you tested on lvm stogage ?

In production its RAID10 / LVM-Thin
The testing I did today is SATA SSD mirrored pair on ZFS

jose.cardoso · Nov 17, 2023

Hello. I'm new in the forum, and new also with Proxmox (I used to work with VMWare).

I have a brand new server with PVE 8.0.4, and I'm experiencing this same error, on a Windows Server 2022 VM:

Reset to device, \Device\RaidPort0, was issued.

But the issuer is storahci driver.

Also, I have a lot of errorr from ESENT:

svchost (8068,D,0) SoftwareUsageMetrics-Svc: A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 524288 (0x0000000000080000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (71 seconds) to be serviced by the OS. In addition, 0 other I/O requests to this file have also taken an abnormally long time to be serviced since the last message regarding this problem was posted 94 seconds ago. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

I have a 24 TB RAID 5 storage, on an harware RAID card MegaRAID SAS 9341-4i.

The storage in PVE is set as LVM.

I'm worried because this is a brand new server, and I'm migrating the domain controller and the file server, from VMware.

Is this something with the drivers?

What is "kvm: Desc next is 3" indicative of?

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Member

New Member

Member

Member

New Member

Member

New Member