Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

I
I Just updated my test servers MSSQL to v271 from v208 about 3 hours ago and I've gotten zero events, I'll keep an eye on it and report if anything pops up.
I updated over the weekend and tonight there was one error that stopped the nightly task. This task was triggering this error pretty regularly. I was hoping this would be resolved after it worked for three days without a hitch.
 
Hi,

I updated MSSQL to driver 271 and the 129 error is back, although it is less frequent but it is back. I am going back to 208 tonight.
Well, I also had to rollback from v266 / v271 to v208 to be stable again. :(

I have a workload to run scsi stresstest with different KMS tuning :
It's using OODefrag to run "complete/name" hard drive optimisation 4 times consecutively (htop = 100% Disk IO)
Code:
----------------------------------------------------------------------------------------------------------------
OK = STABLE
CRASH = kvm: ../block/block-backend.c:1780: blk_drain: Assertion `qemu_in_main_thread()' failed.
----------------------------------------------------------------------------------------------------------------
pve-manager/7.4-19/f98bf8d4 (running kernel: 5.15.158-2-pve)
QEMU emulator version 7.2.10
scsihw: virtio-scsi-single
----------------------------------------------------------------------------------------------------------------
v271 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3794MBs_58,0k_0,31ms / FIO_W = 3807MBs_58,0k_0,31ms)
v271 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3748MBs_70,2k_0,26ms / FIO_W = 3762MBs_70,1k_0,27ms)
v266 + cache=unsafe,discard=on,iothread=1 : CRASH  (FIO_R = 3817MBs_56,2k_0,32ms / FIO_W = 3830MBs_56,2k_0,32ms)
v266 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3804MBs_71,9k_0,26ms / FIO_W = 3818MBs_71,8k_0,26ms)
v208 + cache=unsafe,discard=on,iothread=1 : OK     (FIO_R = 3922MBs_55,6k_0,32ms / FIO_W = 3937MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=on,iothread=0 : OK     (FIO_R = 3823MBs_68,6k_0,27ms / FIO_W = 3835MBs_68,5k_0,27ms)        **BEST**
v208 + cache=unsafe,discard=ignore,iothread=1 : OK (FIO_R = 3856MBs_55,7k_0,32ms / FIO_W = 3867MBs_55,6k_0,32ms)
v208 + cache=unsafe,discard=ignore,iothread=0 : OK (FIO_R = 3806MBs_68,0k_0,27ms / FIO_W = 3819MBs_68,0k_0,27ms)
v208 + discard=on,iothread=1 : OK                  (FIO_R =  234MBs_30,9k_0,95ms / FIO_W =  245MBs_30,8k_1,10ms)
v208 + discard=on,iothread=0 : OK                  (FIO_R =  239MBs_29,9k_0,85ms / FIO_W =  252MBs_29,9k_1,14ms)
----------------------------------------------------------------------------------------------------------------
 
Last edited:
Do you use Local storage ?
What about PVE version 8.4 shipped with QEMU 9.0 and Kernel 6.8 ?
Thoses tests were done on a very simple station: Ryzen 7 5700X + SSD Crucial MX500 sata + Local Thin LVM.
When I will have time, I will also test PVE 8.4 and Qemu 9.2
 
No drivers can improve this consumer ssd where outside of their internal dram cache, they write slow.
Moreover cache=unsafe increase the slowness because of double if not triple cache involved.
I'm totally agree, this kind of hardware is not for for production.
What is interesting here is that the v208 windows driver doesn't kill the VM, contrary to v266 + v271 (blk_drain in block-backend.c with iothread=1)
 
Hey All,

Here are some of my testings that I hope help.
I have upgraded the drivers to v271 on one of my production servers (non critical) to do testing with, here's the verification of said drivers:

1747265272651.png

I ran that same crystal disk benchmark that was proven to force a vioscsi crash last year almost guaranteed and this time the test completed successfully without ever locking up the VM or causing a vioscsi crash

1747265891936.png

Here's the PVE version details:

ceph: 19.2.1-pve3
pve-qemu-kvm: 9.2.0-5
qemu-server: 8.3.12
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)

And the configuration of the Virtual Machine:

Code:
cat /etc/pve/qemu-server/106.conf
agent: 1
bios: ovmf
boot: order=scsi0;net0;scsi2
cores: 4
cpu: host
efidisk0: cluster-storage:vm-106-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1710814878
name: <name>
net0: virtio=BC:24:11:8A:D4:F1,bridge=vmbr1,firewall=1
numa: 1
onboot: 1
ostype: win10
scsi0: cluster-storage:vm-106-disk-1,discard=on,iothread=1,size=70G
scsi1: cluster-storage:vm-106-disk-2,discard=on,iothread=1,size=60G
scsihw: virtio-scsi-single

Happy to do more testing if required but I essentially don't have that vioscsi error on either this machine or my test MSSQL servers.
 
I'm having a real hard time with this issue across the board on my PVE clusters. All of them have Ceph as the main filestore for all the VMs, and all are on 40G dedicated storage network. When I run nightly backups to the cephfs filesystem, my Windows machines completely freeze for up to 5-10 minutes at time throwing that vioscsi device reset error that was mentioned in the thread earlier.

I'm running the latest and greatest VirtIO Drivers (v271), and the Windows machines seem to be most affected, the Linux machines, not so much although they throw a bunch of hung_task error messages on the console.

Any suggestions would be greatly appreciated!
 
Are you using Proxmox Backup Server ?
Yes we are. I had set up nightly vzdump backups for critical systems so I can take the resulting .lzo file or what not and back it up to the cloud, but these systems are being backed up to PBS as well, so I may have to disable the local backups for now and work on a good method to get our critical system backups from PBS into the cloud.
 
PBS has its own issues depending configuration of course, like PBS over WAN or slow link can timeout source VM, PBS fleecing can help nowadays.
But to dig out, I should start disabling PBS temporarily and keep vzdump.
 
I'm having a real hard time with this issue across the board on my PVE clusters. All of them have Ceph as the main filestore for all the VMs, and all are on 40G dedicated storage network. When I run nightly backups to the cephfs filesystem, my Windows machines completely freeze for up to 5-10 minutes at time throwing that vioscsi device reset error that was mentioned in the thread earlier.

I'm running the latest and greatest VirtIO Drivers (v271), and the Windows machines seem to be most affected, the Linux machines, not so much although they throw a bunch of hung_task error messages on the console.

Any suggestions would be greatly appreciated!
you really should use fleecing option in your backup advanced option. (you can use same ceph rbd storage than your main vm).

(btw, I hope than your cephfs storage for your backup is not in the same ceph cluster than your production rbd storage, right ? ;)
 
  • Like
Reactions: Johannes S
Hello, I had the same issue with VMs locking up during backups, I managed to solve it by scheduling Node based backups rather than 1 single backup for all 3 nodes

1752793033653.png
This way it only backups up 1 VM at a time and doesn't cause massive I/O spikes during backup time.