Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

I
I Just updated my test servers MSSQL to v271 from v208 about 3 hours ago and I've gotten zero events, I'll keep an eye on it and report if anything pops up.
I updated over the weekend and tonight there was one error that stopped the nightly task. This task was triggering this error pretty regularly. I was hoping this would be resolved after it worked for three days without a hitch.
 
Hi,

I updated MSSQL to driver 271 and the 129 error is back, although it is less frequent but it is back. I am going back to 208 tonight.
Well, I also had to rollback from v266 / v271 to v208 to be stable again. :(

I have a workload to run scsi stresstest with different KMS tuning :
It's using OODefrag to run "complete/name" hard drive optimisation 4 times consecutively (htop = 100% Disk IO)
Code:
----------------------------------------------------------------------------------------------------------------
OK = STABLE
CRASH = kvm: ../block/block-backend.c:1780: blk_drain: Assertion `qemu_in_main_thread()' failed.
----------------------------------------------------------------------------------------------------------------
pve-manager/7.4-19/f98bf8d4 (running kernel: 5.15.158-2-pve)
QEMU emulator version 7.2.10
scsihw: virtio-scsi-single
----------------------------------------------------------------------------------------------------------------
v271 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3794MBs_58,0k_0,31ms / FIO_W = 3807MBs_58,0k_0,31ms)
v271 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3748MBs_70,2k_0,26ms / FIO_W = 3762MBs_70,1k_0,27ms)
v266 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3817MBs_56,2k_0,32ms / FIO_W = 3830MBs_56,2k_0,32ms)
v266 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3804MBs_71,9k_0,26ms / FIO_W = 3818MBs_71,8k_0,26ms)
v208 + cache=unsafe,discard=on,iothread=1 : OK     (FIO_R = 3922MBs_55,6k_0,32ms / FIO_W = 3937MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3823MBs_68,6k_0,27ms / FIO_W = 3835MBs_68,5k_0,27ms)        **BEST**
v208 + cache=unsafe,discard=ignore,iothread=1 : OK (FIO_R = 3856MBs_55,7k_0,32ms / FIO_W = 3867MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=ignore,iothread=0 : OK (FIO_R = 3806MBs_68,0k_0,27ms / FIO_W = 3819MBs_68,0k_0,27ms)
v208 + discard=on,iothread=1 : OK                  (FIO_R =  234MBs_30,9k_0,95ms / FIO_W =  245MBs_30,8k_1,10ms)
v208 + discard=on,iothread=0 : OK                  (FIO_R =  239MBs_29,9k_0,85ms / FIO_W =  252MBs_29,9k_1,14ms)
----------------------------------------------------------------------------------------------------------------
 
Last edited:
Do you use Local storage ?
What about PVE version 8.4 shipped with QEMU 9.0 and Kernel 6.8 ?
Thoses tests were done on a very simple station: Ryzen 7 5700X + SSD Crucial MX500 sata + Local Thin LVM.
When I will have time, I will also test PVE 8.4 and Qemu 9.2
 
No drivers can improve this consumer ssd where outside of their internal dram cache, they write slow.
Moreover cache=unsafe increase the slowness because of double if not triple cache involved.
I'm totally agree, this kind of hardware is not for for production.
What is interesting here is that the v208 windows driver doesn't kill the VM, contrary to v266 + v271 (blk_drain in block-backend.c with iothread=1)
 
Hey All,

Here are some of my testings that I hope help.
I have upgraded the drivers to v271 on one of my production servers (non critical) to do testing with, here's the verification of said drivers:

1747265272651.png

I ran that same crystal disk benchmark that was proven to force a vioscsi crash last year almost guaranteed and this time the test completed successfully without ever locking up the VM or causing a vioscsi crash

1747265891936.png

Here's the PVE version details:

ceph: 19.2.1-pve3
pve-qemu-kvm: 9.2.0-5
qemu-server: 8.3.12
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)

And the configuration of the Virtual Machine:

Code:
cat /etc/pve/qemu-server/106.conf
agent: 1
bios: ovmf
boot: order=scsi0;net0;scsi2
cores: 4
cpu: host
efidisk0: cluster-storage:vm-106-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1710814878
name: <name>
net0: virtio=BC:24:11:8A:D4:F1,bridge=vmbr1,firewall=1
numa: 1
onboot: 1
ostype: win10
scsi0: cluster-storage:vm-106-disk-1,discard=on,iothread=1,size=70G
scsi1: cluster-storage:vm-106-disk-2,discard=on,iothread=1,size=60G
scsihw: virtio-scsi-single

Happy to do more testing if required but I essentially don't have that vioscsi error on either this machine or my test MSSQL servers.
 
  • Like
Reactions: _gabriel