Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

@liim

Using IDE / SATA disks has been a work around for some time.

The VM config isn't much use since you moved away from vioscsi...
Having said that, I presumed you were still using aio=io_uring or aio=native with vioscsi, so exposed to the qemu global mutex.
This means other operations will share the thread with your disk.
Perhaps the scrubbing was the root cause...?

The lack of swap might rule out memory pressure from using a relatively large ARC size.

You mentioned VirtIO block. This is the viostor driver and not vioscsi.
You needed to use VirtIO SCSI single and not VirtIO block...!

I'm still inclined to think your issue is something else... Given the pressure has dropped off, maybe we will never know...

If you had a screenshot of one of the errors that would be helpful too.
I presume you didn't get any kvm errors in the PVE journal. This is a distinctive feature of this problem.
 
Switched VM OS disk from VirtIO block to ide
both don't use the fixed driver.
VirtIO Block isn't affected because it isn't qemu thread independant yet.
Only SCSI disk type + VirtIO SCSI Single Controller is fixed in this thread.
IDE + SATA don't use VirtIO drivers at all. they aren't impacted because they are slower than SCSI VirtIO, and they aren't qemu thread indépendant.
 
  • Like
Reactions: Whatever
Version 100.85.104.20800 installed now.

I made a few other changes since that last post.
  • Stopped scrubbing runs on both Proxmox servers - this seemed to be having the most impact, and is probably the cause of the OS lockups
  • Switched VM OS disk from VirtIO block to ide (This has no requirement to be fast, just needs to be reliable) - There had been no reset warnings related to this
  • Downgraded the vioscsi driver to 208 as mentioned
I was still seeing errors (1/min), but interestingly the general disk stability was a lot better and was avoiding the 1 minute pauses seen previously. The backups are now working, so pressure to fix this has dropped off substantially.
I'd tend to agree with the other suggestions in this thread that this might rather be a performance problem of the underlying storage. I'd expect a RAIDZ2 pool with spinning disks to be quite slow, and the ahcistor warnings when attaching the disks via SATA (the VirtIO SCSI/Block guest drivers wouldn't be involved in that case) and that the issues improve when scrubbing on the host is paused, also hints into that direction. If you'd like to debug this further -- can you check the IO pressure in /proc/pressure/io [1] while the VM is running / while you are seeing the issues? Also, would it be possible to temporarily move the VM disks to a fast local storage (like local SSD or NVME), and see if you still see issues then? If you'd like to look into this further, it would be great if you could open a new thread -- feel free to reference it here.

[1] https://facebookmicrosites.github.io/psi/docs/overview
 
I regret to share that I am using Proxmox VE 8.3 with spinning disks and kernel 6.8, and despite varying the VM configurations, there is consistently a very high I/O delay. This issue is particularly noticeable during operations involving both network and disk activity, such as backups, restores, and snapshots.
 
I regret to share that I am using Proxmox VE 8.3 with spinning disks and kernel 6.8, and despite varying the VM configurations, there is consistently a very high I/O delay. This issue is particularly noticeable during operations involving both network and disk activity, such as backups, restores, and snapshots.
Are these Windows VMs using VirtIO SCSI, and if yes, do you also see the device resets discussed in this thread in the Windows event viewer? The issue rather sounds like the underlying storage may be the culprit. Could you please open a new thread and provide some more details, including the output of pveversion -v, the config of an affected VM (the output of qm config VMID), the storage configuration (the contents of /etc/pve/storage.cfg) and some more details on your storage setup?
 
  • Like
Reactions: Fantu
Are these Windows VMs using VirtIO SCSI, and if yes, do you also see the device resets discussed in this thread in the Windows event viewer? The issue rather sounds like the underlying storage may be the culprit. Could you please open a new thread and provide some more details, including the output of pveversion -v, the config of an affected VM (the output of qm config VMID), the storage configuration (the contents of /etc/pve/storage.cfg) and some more details on your storage setup?
These are Linux-based virtual machines (VMs) only with Linux. All the VMs experienced significant I/O wait issues, but the Proxmox host itself does not appear to be affected.

After thorough testing of system procedures such as backups, restores, and snapshots, all performed on the same storage, the results were clear: Proxmox 7 does not exhibit I/O wait problems, while Proxmox 8 consistently shows these issues regardless of the kernel version. I tested multiple kernels (6.2, 6.5, 6.8, and 6.11) with Proxmox 8, all using the same storage, and the I/O wait problem persisted across all configurations.

This is deeply frustrating, as it raises concerns about the reliability of Proxmox 8 and has led me to consider abandoning Proxmox entirely.
 
If your issue is not related (as you have only linux vm and here is a topic only related to virtio driver for windows vm) please open a new topic related.
Also post all the information requested by @fweber, without information is impossible help to found the cause of your issue.
 
As I understand it, the "Reset to device\Device\RaidPort" problem has been resolved in the latest update of the virtio drivers.

As far as I understand, the virtio developers solved this problem by releasing a driver update. They write about it here
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1150

I have installed virtio-win-0.1.266.iso and for 4 days now, there has never been a "Reset to device\Device\RaidPort" in the logs. Before the installation, it was every day several times.


Configuration of my VM:
SCSI Controller: VirtIO SCSI single
HDD: SCSI
Cache: Default (No cache)
Discard enabled
IOthread enabled
SSD emulation enabled
Async IO: Default (io_uring)

Windows Server 2022

Full config VM:
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide2
cores: 20
cpu: x86-64-v2-AES
efidisk0: zp-ilogy-pve3-ssd1-thin:vm-114-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: iso-ssd1:iso/ru-ru_windows_server_2022_updated_sep_2024_x64_dvd_cab4e960.iso,media=cdrom,size=5788770K
machine: pc-q35-9.0
memory: 20480
meta: creation-qemu=9.0.2,ctime=1730190753
name: MOSK-TS-2
net0: virtio=BC:24:11:99:F6:0F,bridge=vmbr1
net1: virtio=BC:24:11:5E:41:13,bridge=vmbr1,tag=55
numa: 1
onboot: 1
ostype: win11
parent: do_ispravleniya_1066
protection: 1
scsi0: zp-ilogy-pve3-ssd1-thin:vm-114-disk-2,discard=on,iothread=1,size=500G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=9bf504d4-c7f9-4425-b9bd-f8dcd9b7efa2
sockets: 1
tpmstate0: zp-ilogy-pve3-ssd1-thin:vm-114-disk-1,size=4M,version=v2.0
vmgenid: a782d5c9-912c-4a8d-bce1-fbd128384653
 
Has anyone tried v266 with a Ceph backing...?
yes, works flawlessly

the only thing i noticed with v266 is, that HDD backed Ceph with very high queue kills the specific volume and it because unresponsive and stuck forever. Best way to trigger that is having a Fileserver with deduplication, start a garbage collect, and now comes the most important part, having cache set to writeback. Setting Cache from writeback to none resolves my problem.

can't say for sure if this is related to v266, unfortunately can't test any other driver version at the moment.
 
I didn't manage to insall v266 on dozen of my vms. MSI fails with revert unfortunately
If the installer issue persists even with @_gabriel's workaround, can you open a new thread (and mention me there)? So far I couldn't reliably reproduce the installer issue, so I'd be interested in looking into this further.
 
Has anyone tried v266 with a Ceph backing...?

yes, works flawlessly

the only thing i noticed with v266 is, that HDD backed Ceph with very high queue kills the specific volume and it because unresponsive and stuck forever. Best way to trigger that is having a Fileserver with deduplication, start a garbage collect, and now comes the most important part, having cache set to writeback. Setting Cache from writeback to none resolves my problem.

can't say for sure if this is related to v266, unfortunately can't test any other driver version at the moment.

Great news..!
This is most probably due to fixes included with v266.
The change in cache makes sense. This should really be left to the "lowest" level possible, i.e. closest to bare metal.
(Oh no, I hope I didn't start a flame war with that comment..!!)
 
For those interested, in the last few days there have been many PRs raised for fixes in viostor (the VirtIO Block driver).
These are still pending review, but if approved and merged it will likely mean that the block driver will outperform the vioscsi driver.
It is important to remember that both drivers are MS StorPort drivers, with the vioscsi driver being ported from the MS SCSI Port driver some time ago.
Recent changes to viostor, i.e. virtio-blk devices, esp. the I/O Thread Virtqueue Mapping support mentioned by @RoCE-geek earlier in this topic and RH devs officially elsewhere, might mean this is the go to offering in future storage setups. Still a bit to go though...

CC: @fweber @fiona