Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

complexplaster27 · May 14, 2025

kcybulski said:
Hi,

I updated MSSQL to driver 271 and the 129 error is back, although it is less frequent but it is back. I am going back to 208 tonight.View attachment 86006

I Just updated my test servers MSSQL to v271 from v208 about 3 hours ago and I've gotten zero events, I'll keep an eye on it and report if anything pops up.

kcybulski · May 14, 2025

I

complexplaster27 said:
I Just updated my test servers MSSQL to v271 from v208 about 3 hours ago and I've gotten zero events, I'll keep an eye on it and report if anything pops up.

I updated over the weekend and tonight there was one error that stopped the nightly task. This task was triggering this error pretty regularly. I was hoping this would be resolved after it worked for three days without a hitch.

RCK · May 14, 2025

kcybulski said:
Hi,

I updated MSSQL to driver 271 and the 129 error is back, although it is less frequent but it is back. I am going back to 208 tonight.

Well, I also had to rollback from v266 / v271 to v208 to be stable again.

I have a workload to run scsi stresstest with different KMS tuning :
It's using OODefrag to run "complete/name" hard drive optimisation 4 times consecutively (htop = 100% Disk IO)

Code:

----------------------------------------------------------------------------------------------------------------
OK = STABLE
CRASH = kvm: ../block/block-backend.c:1780: blk_drain: Assertion `qemu_in_main_thread()' failed.
----------------------------------------------------------------------------------------------------------------
pve-manager/7.4-19/f98bf8d4 (running kernel: 5.15.158-2-pve)
QEMU emulator version 7.2.10
scsihw: virtio-scsi-single
----------------------------------------------------------------------------------------------------------------
v271 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3794MBs_58,0k_0,31ms / FIO_W = 3807MBs_58,0k_0,31ms)
v271 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3748MBs_70,2k_0,26ms / FIO_W = 3762MBs_70,1k_0,27ms)
v266 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3817MBs_56,2k_0,32ms / FIO_W = 3830MBs_56,2k_0,32ms)
v266 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3804MBs_71,9k_0,26ms / FIO_W = 3818MBs_71,8k_0,26ms)
v208 + cache=unsafe,discard=on,iothread=1 : OK     (FIO_R = 3922MBs_55,6k_0,32ms / FIO_W = 3937MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3823MBs_68,6k_0,27ms / FIO_W = 3835MBs_68,5k_0,27ms)        **BEST**
v208 + cache=unsafe,discard=ignore,iothread=1 : OK (FIO_R = 3856MBs_55,7k_0,32ms / FIO_W = 3867MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=ignore,iothread=0 : OK (FIO_R = 3806MBs_68,0k_0,27ms / FIO_W = 3819MBs_68,0k_0,27ms)
v208 + discard=on,iothread=1 : OK                  (FIO_R =  234MBs_30,9k_0,95ms / FIO_W =  245MBs_30,8k_1,10ms)
v208 + discard=on,iothread=0 : OK                  (FIO_R =  239MBs_29,9k_0,85ms / FIO_W =  252MBs_29,9k_1,14ms)
----------------------------------------------------------------------------------------------------------------

_gabriel · May 14, 2025

RCK said:
I also had to rollback from v266 / v271 to v208 to be stable again.

Do you use Local storage ?
What about PVE version 8.4 shipped with QEMU 9.0 and Kernel 6.8 ?

RCK · May 14, 2025

_gabriel said:
Do you use Local storage ?
What about PVE version 8.4 shipped with QEMU 9.0 and Kernel 6.8 ?

Thoses tests were done on a very simple station: Ryzen 7 5700X + SSD Crucial MX500 sata + Local Thin LVM.
When I will have time, I will also test PVE 8.4 and Qemu 9.2

_gabriel · May 14, 2025

RCK said:
SSD Crucial MX500 sata

No drivers can improve this consumer ssd where outside of their internal dram cache, they write slow.
Moreover cache=unsafe increase the slowness because of double if not triple cache involved.

RCK · May 14, 2025

_gabriel said:
No drivers can improve this consumer ssd where outside of their internal dram cache, they write slow.
Moreover cache=unsafe increase the slowness because of double if not triple cache involved.

I'm totally agree, this kind of hardware is not for for production.
What is interesting here is that the v208 windows driver doesn't kill the VM, contrary to v266 + v271 (blk_drain in block-backend.c with iothread=1)

_gabriel · May 14, 2025

What about v266 + default PVE cache=none + iothread=1

spirit · May 14, 2025

please test with pve8 && recent qemu, I remember than some iothreads crash has been fixed since. (and pve7 is EOL anyway)

complexplaster27 · May 15, 2025

Hey All,

Here are some of my testings that I hope help.
I have upgraded the drivers to v271 on one of my production servers (non critical) to do testing with, here's the verification of said drivers:

I ran that same crystal disk benchmark that was proven to force a vioscsi crash last year almost guaranteed and this time the test completed successfully without ever locking up the VM or causing a vioscsi crash

Here's the PVE version details:

ceph: 19.2.1-pve3
pve-qemu-kvm: 9.2.0-5
qemu-server: 8.3.12
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)

And the configuration of the Virtual Machine:

Code:

cat /etc/pve/qemu-server/106.conf
agent: 1
bios: ovmf
boot: order=scsi0;net0;scsi2
cores: 4
cpu: host
efidisk0: cluster-storage:vm-106-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1710814878
name: <name>
net0: virtio=BC:24:11:8A:D4:F1,bridge=vmbr1,firewall=1
numa: 1
onboot: 1
ostype: win10
scsi0: cluster-storage:vm-106-disk-1,discard=on,iothread=1,size=70G
scsi1: cluster-storage:vm-106-disk-2,discard=on,iothread=1,size=60G
scsihw: virtio-scsi-single

Happy to do more testing if required but I essentially don't have that vioscsi error on either this machine or my test MSSQL servers.

trilljester · 2025-07-17T18:39:43+0200

I'm having a real hard time with this issue across the board on my PVE clusters. All of them have Ceph as the main filestore for all the VMs, and all are on 40G dedicated storage network. When I run nightly backups to the cephfs filesystem, my Windows machines completely freeze for up to 5-10 minutes at time throwing that vioscsi device reset error that was mentioned in the thread earlier.

I'm running the latest and greatest VirtIO Drivers (v271), and the Windows machines seem to be most affected, the Linux machines, not so much although they throw a bunch of hung_task error messages on the console.

Any suggestions would be greatly appreciated!

_gabriel · 2025-07-17T19:43:47+0200

trilljester said:
When I run nightly backups to the cephfs filesystem

Are you using Proxmox Backup Server ?

trilljester · 2025-07-17T19:45:39+0200

_gabriel said:
Are you using Proxmox Backup Server ?

Yes we are. I had set up nightly vzdump backups for critical systems so I can take the resulting .lzo file or what not and back it up to the cloud, but these systems are being backed up to PBS as well, so I may have to disable the local backups for now and work on a good method to get our critical system backups from PBS into the cloud.

_gabriel · 2025-07-17T20:46:23+0200

PBS has its own issues depending configuration of course, like PBS over WAN or slow link can timeout source VM, PBS fleecing can help nowadays.
But to dig out, I should start disabling PBS temporarily and keep vzdump.

spirit · 2025-07-17T22:25:28+0200

trilljester said:
I'm having a real hard time with this issue across the board on my PVE clusters. All of them have Ceph as the main filestore for all the VMs, and all are on 40G dedicated storage network. When I run nightly backups to the cephfs filesystem, my Windows machines completely freeze for up to 5-10 minutes at time throwing that vioscsi device reset error that was mentioned in the thread earlier.

I'm running the latest and greatest VirtIO Drivers (v271), and the Windows machines seem to be most affected, the Linux machines, not so much although they throw a bunch of hung_task error messages on the console.

Any suggestions would be greatly appreciated!

you really should use fleecing option in your backup advanced option. (you can use same ceph rbd storage than your main vm).

(btw, I hope than your cephfs storage for your backup is not in the same ceph cluster than your production rbd storage, right ?

complexplaster27 · 2025-07-18T00:58:00+0200

Hello, I had the same issue with VMs locking up during backups, I managed to solve it by scheduling Node based backups rather than 1 single backup for all 3 nodes

This way it only backups up 1 VM at a time and doesn't cause massive I/O spikes during backup time.

Search

Search

Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

complexplaster27

Member

kcybulski

Active Member

RCK

Renowned Member

_gabriel

Famous Member

RCK

Renowned Member

_gabriel

Famous Member

RCK

Renowned Member

_gabriel

Famous Member

spirit

Distinguished Member

complexplaster27

Member

trilljester

Active Member

_gabriel

Famous Member

trilljester

Active Member

_gabriel

Famous Member

spirit

Distinguished Member

complexplaster27

Member

We value your privacy