Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

carl0s · Nov 24, 2024

The installer for 266 worked fine for me on ~4 Server 2022 machines.

itNGO · Nov 24, 2024

carl0s said:
The installer for 266 worked fine for me on ~4 Server 2022 machines.

You mean 266-2 I guess?

carl0s · Nov 24, 2024

itNGO said:
You mean 266-2 I guess?

No, 266-1. That's the latest I see on the fedorapeople site.

robleady · Nov 27, 2024

Can anyone confirm if the 0.1.266-1 driver bundle from the Fedora Download site is supposed to fix this issue?

I'm still seeing "Reset to device, \Device\RaidPort1, was issued." with this version, and having looked at the source RPMs, they don't appear to have all of the commits mentioned on Github.

carl0s · Nov 27, 2024

robleady said:
Can anyone confirm if the 0.1.266-1 driver bundle from the Fedora Download site is supposed to fix this issue?

I'm still seeing "Reset to device, \Device\RaidPort1, was issued." with this version, and having looked at the source RPMs, they don't appear to have all of the commits mentioned on Github.

oh, yes you're right. I'm not too sure which bits are most relvevant though. I switched my VMs to virtio block (which is viostor driver, right?) and all looks OK so far. Strangely only one of my systems was affected anyway. I saw it when opening the very large event log while it was installing windows updates.

https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1196#issuecomment-2490465751

benyamin-codez commented last week

@vrozenfe Thanks for updating the tags, Vadim.
@MaxXor

I think it should at least have the SendSRB() and DPC fixes, refactoring, tracing improvements, etc.

Click to expand...

From the tags it looks like the cut-off was in the ides of October:
https://github.com/virtio-win/kvm-g...its/96b358e3137139d54459107ee8d76baa1a7401d3/
Notably, the viostor back-ported fixes will be missing.
Perhaps a prudent risk mitigation strategy should vioscsi prove to have regressions.

davemcl · Nov 27, 2024

Installed here November 19 - smooth sailing so far.

_gabriel · Nov 27, 2024

carl0s said:
switched my VMs to virtio block (which is viostor driver, right?)

no, virtio block is another driver, which currently missing io thread.
fix here is for virtio scsi driver, used when vdisk attached as scsi + scsi controller set to virtio scsi single.

Fantu · Dec 2, 2024

Hi, sorry for the question, but was the fix for virtioscsi included in 0.1.266-1?
I seem to see conflicting answers about whether it is included and the problem is solved or not, from corresponding tag in the github repository result not included (if I checked correctly), so even if someone wrote here that he has no problems with the latest version the problem remains, and therefore I have to use maximum version 0.1.208 for virtioscsi to be sure to avoid the issue?
I'm trying to figure out which version of virtio I should put (to avoid the problem) on some windows servers with critical services that I'll migrate to proxmox soon.

benyamin · Dec 2, 2024

Just to clear up any confusion:

The fixes for the viostor (VirtIO SCSI Controller) driver are NOT in v266.
The fixes for the vioscsi (VirtIO SCSI pass-through controller) driver are in v266.

See my comment here.
Per Vadim's comment above that post, the next stable public build will be released in mid-end of January 2025.
This should include the fixes for viostor.

benyamin · Dec 2, 2024

Furthermore:

viostor is for Proxmox VirtIO Block hard disk devices.
vioscsi is for Proxmox Virtio SCSI and VirtIO SCSI single devices.

Fantu · Dec 2, 2024

Thanks for the reply and details.
I started to update virtio 0.1.266-1 on 2 servers with low usage, one windows 2019 and one windows 2022, on both after reboot network interface partially lost the configuration, easily solvable if you mark the configuration before the update and then check it and set what is missing but knowing the problem could be useful to avoid problems if someone updates virtio but does not check (maybe for mass updates remotely or with reboot afterwards).
Has this issue been seen by others?

_gabriel · Dec 2, 2024

it's another bug with the installer, updating only scsi driver from Windows Device Manager is enough to fix it.

fweber · Dec 5, 2024

The reproducer (using VirtIO SCSI disks) I've been using has not yet triggered any hangs (or vioscsi event ID 129 "Reset to device [...] issued" messages) when using virtio-win 0.1.266. This looks promising, thanks to @benyamin's upstream fixes!

Has anyone running virtio-win 0.1.266 with VirtIO SCSI disks seen any more hangs or "Reset to device [...] issued" messages? If yes, please post here.

robleady said:
Can anyone confirm if the 0.1.266-1 driver bundle from the Fedora Download site is supposed to fix this issue?

I'm still seeing "Reset to device, \Device\RaidPort1, was issued." with this version, and having looked at the source RPMs, they don't appear to have all of the commits mentioned on Github.

Can you double-check that you are using VirtIO SCSI (not VirtIO Block) disks, and that the vioscsi driver is at 0.1.266?

Fantu · Dec 5, 2024

So far on the servers I've updated to 0.1.266 I haven't seen the problem anymore, including one where I had it reproduced easily with a recent virtio version (but < 0.1.266) and I had to downgrade for workaround.

For those who still have problems I recommend checking also in the device manager the version of the driver effectively in use by the virtio scsi controller, also check that the disks are actually set to scsi (in the vm configuration).

liim · Dec 8, 2024

Edit: Updated to PVE version 8.3.1 and rebooted to latest kernel - warnings appear to have reduced significantly, although still present.

Unfortunately, a backup server that we are running has been able to reproduce the vioscsi errors following an upgrade to version 266 (as linked above from Fedora repos).
Interestingly, the reason that we tried again with VirtIO Scsi was because we began running into ahcistor warnings when the disks were attached in SATA mode. I have not seen much discussion on ahcistor, but it is a similar thing to vioscsi with the "Reset to device..." warning appearing in Event Viewer. Although the vioscsi errors appear more frequently (disk speeds seem slower? It has a much less dramatic effect - the ahcistor reset caused a 1min pause in availability)

Information on server/Guest:
-Physical disks zfs raid z2 (mechanical HDDs)
-VE version 8.2.7
-Guest OS Windows Server 2022 Std (Latest Build 21H2 20348.2849)

itNGO · Dec 8, 2024

liim said:
Edit: Updated to PVE version 8.3.1 and rebooted to latest kernel - warnings appear to have reduced significantly, although still present.

Unfortunately, a backup server that we are running has been able to reproduce the vioscsi errors following an upgrade to version 266 (as linked above from Fedora repos).
Interestingly, the reason that we tried again with VirtIO Scsi was because we began running into ahcistor warnings when the disks were attached in SATA mode. I have not seen much discussion on ahcistor, but it is a similar thing to vioscsi with the "Reset to device..." warning appearing in Event Viewer. Although the vioscsi errors appear more frequently (disk speeds seem slower? It has a much less dramatic effect - the ahcistor reset caused a 1min pause in availability)

Information on server/Guest:
-Physical disks zfs raid z2 (mechanical HDDs)
-VE version 8.2.7
-Guest OS Windows Server 2022 Std (Latest Build 21H2 20348.2849)

I believe this is not the same issue and you should create a new topic about this...

benyamin · Dec 9, 2024

itNGO said:
I believe this is not the same issue and you should create a new topic about this...

It does seem different.

@liim, from you description it seems vioscsi is not generating "Reset to device..." errors and the system remains somewhat responsive.

I would urgently review your logs for SMART errors, and provided the data is backed up, consider running some tests with smartctl...

liim · Dec 9, 2024

benyamin said:
It does seem different.

@liim, from you description it seems vioscsi is not generating "Reset to device..." errors and the system remains somewhat responsive.

Thanks for the replies. I think I have slightly ended up misleading by getting my hopes up on a Sunday night. The vioscsi device reset warnings have been coming in regularly (once every 2-3min) again now that the I/O load has ramped up causing all the same backup failures that got me to start looking into this.

I have also got another server that is able to replicate this issue identically - same hardware/software configuration, same errors, which makes me less inclined to think it is a physical disk error.

I will try to dig up the old 208 driver version to see if I can reproduce on that, which would at least assure us that it's a separate issue.

benyamin · Dec 11, 2024

liim said:
I will try to dig up the old 208 driver version to see if I can reproduce on that, which would at least assure us that it's a separate issue.

Did you get anywhere with this?

Given you are using spinning iron, I was thinking it could also be memory pressure or qemu global mutex IO issues.
Can you share your zfs config - including ARC - and your command line, i.e. qm showcmd <VMID> --pretty?
What do you have your "swappiness" set to? You can check with: sysctl vm.swappiness
Do you have Ceph configured?

liim · Dec 11, 2024

liim said:
I will try to dig up the old 208 driver version to see if I can reproduce on that, which would at least assure us that it's a separate issue.

Version 100.85.104.20800 installed now.

I made a few other changes since that last post.

Stopped scrubbing runs on both Proxmox servers - this seemed to be having the most impact, and is probably the cause of the OS lockups
Switched VM OS disk from VirtIO block to ide (This has no requirement to be fast, just needs to be reliable) - There had been no reset warnings related to this
Downgraded the vioscsi driver to 208 as mentioned

I was still seeing errors (1/min), but interestingly the general disk stability was a lot better and was avoiding the 1 minute pauses seen previously. The backups are now working, so pressure to fix this has dropped off substantially.

Further to this, I have now remounted the backup drive to be mounted once again on a SATA controller - there have been no ahcicontroller errors following the zfs scrub finishing

There is no swap configured on this VE (swappiness is default 60 for record)

I would be interested to hear if @dsjfshdfjklsdjfkkjv experienced anything similar since moving to SATA controller.

ZFS Config and VM config:

Code:

  pool: veeam3-raidz2
 state: ONLINE
  scan: scrub repaired 0B in 1 days 08:04:52 with 0 errors on Mon Dec  9 08:28:55 2024
config:

        NAME                                    STATE     READ WRITE CKSUM
        veeam3-raidz2                           ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_2JK0PDEB  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_2KG941JW  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_4BKJYZAV  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_6PG8AR4U  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_51G1PN5R  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_3ZG3D4WJ  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_3ZG18N1A  ONLINE       0     0     0
            ata-WDC_WD161KRYZ-01AGBB0_3ZG3BHVA  ONLINE       0     0     0

errors: No known data errors

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    87.0 %   10.2 GiB
        Target size (adaptive):                        87.4 %   10.2 GiB
        Min size (hard limit):                         8.3 %  1000.9 MiB
        Max size (high water):                           11:1   11.7 GiB


/usr/bin/kvm \
  -id 101 \
  -name 'Veeam-C,debug-threads=on' \
  -no-shutdown \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server=on,wait=off' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/101.pid \
  -daemonize \
  -smbios 'type=1,uuid=c807488c-8ba0-47b3-b781-f8882efb8da0' \
  -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.fd' \
  -drive 'if=pflash,unit=1,id=drive-efidisk0,format=raw,file=/dev/zvol/rpool/data/vm-101-disk-0,size=540672' \
  -smp '24,sockets=2,cores=12,maxcpus=24' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc 'unix:/var/run/qemu-server/101.vnc,password=on' \
  -cpu 'host,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt' \
  -m 20480 \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
  -device 'vmgenid,guid=36e4a57f-05a6-4b23-9d77-c704ed0fd1a4' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:184bb0d19f66' \
  -drive 'file=/dev/zvol/rpool/data/vm-101-disk-1,if=none,id=drive-ide0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' \
  -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
  -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/dev/zvol/veeam3-raidz2/vm-101-disk-0,if=none,id=drive-sata1,aio=native,format=raw,cache=none,detect-zeroes=on' \
  -device 'ide-hd,bus=ahci0.1,drive=drive-sata1,id=sata1' \
  -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=BC:24:11:5E:9F:91,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=102' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'hpet=off,type=pc-i440fx-8.1+pve0' \
  -global 'kvm-pit.lost_tick_policy=discard'

Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Renowned Member

Famous Member

Renowned Member

New Member

Renowned Member

benyamin-codez commented last week​

Member

Famous Member

Active Member

Member

Member

Active Member

Famous Member

Proxmox Staff Member

Active Member

New Member

Famous Member

Member

New Member

Member

New Member

We value your privacy

benyamin-codez commented last week