Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Hi @fweber,

I've similar findings as @benyamin, but not exactly the same.

Tests done on my VM102 (https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337, but the corresponding PVE node is no longer empty)

Both scsi drives on aio=io_uring.

1. VirtIO 0.1.208 (vgs=6):
RDP is stable, IO stable/max, no issues
2. VirtIO 0.1.208 + CPU hotplug (vgs=3): huge RDP hangs, IO works, but lower Q32T16 Read performance
3. VirtIO 0.1.240 + CPU hotplug (vgs=3): medium RDP hangs, IO works, but lower Q32T16 Read performance
4. VirtIO 0.1.240 (vgs=6): buggy as reported

As always, using RND 4K Q32T16 test was super-easy to invoke the bug.

So, there are no IO hangs / scsi alerts (with CPU hotplug / vgs = 3) on 0.1.240, but while the VM works, it appears to be unresponsive for up to 10s (for both 0.1.208 and 0.1.240), but data flow continues. So from user perspective it's buggy, but in reality it works (with lower IO performance) and the virtio problem is mitigated. "RDP hang" is not the exact definition - just the session seemed to be hanged, although this could be caused by multiple factors in this setup (some general resource congestion is evident here).

Long story short: This is a good case for debug, but not suitable setup for production.

Update: to be clear, iothread=1, SCSI Single and the tested drive was scsi1.
And Friedrich (@fweber),

regarding to your latest github post (https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756#issuecomment-2293748551),
please let me note that I'm able to invoke the bug with 4K request size as well (with aio=io_uring - being the most sensitive setup with SCSI Single and iothread=1 - it's almost 100%).

While I fully accept the idea of the newer kernels influence, still there have to be some clear presumption what is the link to 0.1.215 changes.

And as I also highly appreciate all @benyamin 's extreme efforts, which may lead to higher bandwidth, better stability and even to the final solution, I'll never stop thinking about the exact root cause in 0.1.215 - simply to prevent the same bug to be introduced once again in the future versions.

So in my view, there are two more or less independent "research rails":

1. Make VirtIO better without respect to driver's changes, i.e. resolve the bug and even add some premium (more bandwidth, etc.)
2. Analyze exact root cause, i.e. find the specific change/s in 0.1.215 (in comparison to 0.1.208), as the "1/0 switch" is still present there

Maybe I'm the only one here, but in that gtihub thread it's not always clear what is the scope (or a full idea behind) of each post and sometimes it's going to be messy/chaotic again. And I'm still not sure if the VirtIO guys are fully synced with our findings with some bigger picture.

Anyways, if you both will have some binaries to check out, you can send it to me (in PM, or so) and I'll try to check them too.

And the same is valid for @bbgeek17 as well. Thanks to you all.
 
It's not really ready for showtime ... yet.

There's clearly a problem with multi-queue. I'm getting wildly different results with iothreads=1 and VirtIO SCSI Single - and not in a good way. The performance is much better with just one adapter and just one disk, e.g. for 1MiB block size sequential reads with 4x vCPU, diskspd shows that when the multi-queue implementation is compared to a single HBA, the latter is 5.82x better performing when using -o8 -t4 -si and 11.55x better performing when using -o8 -t4 -T4k -s4096k (@ ~28.7GB/s vs. ~4.7GB/s and ~2.5GB/s respectively).

Whilst one HBA might be the use case for most guests, the underlying issue affects the ability to reliably perform I/O to such an extent that performance stats require interpretation. This is true with the v208 driver too, and fixing the issue there on master results in even more unstable performance. The additional optimisation work helps stabilise the driver, but it is really just masking the underlying problem.

This underlying issue has been around for some time, well before even v208. Let's see where I can get with solving it over the next week...
 
I already told in other thread, virtio emulation is not stable/reliable for WINDOWS based VMs.

You can easily reproduce the "disk device reset" bug at any time:
> Install MS SQL Server with larger database (example: WSUS with MSSQL backend),
> Add virtio disk to the VM ("backup-disk"),
> Create MSSQL-backup-job to the "backup-disk",
> Run MSSQL-backup-job, BOOM -> you hit the "feature/bug: disk device reset".
 
Last edited:
I already told in other thread, virtio emulation is not stable/reliable for WINDOWS based VMs.

You can easily reproduce the "disk device reset" bug at any time:
> Install MS SQL Server with larger database (example: WSUS with MSSQL backend),
> Add virtio disk to the VM ("backup-disk"),
> Create MSSQL-backup-job to the "backup-disk",
> Run MSSQL-backup-job, BOOM -> you hit the "feature/bug: disk device reset".
Hi, all this is known to us, i.e. we've no problems with the bug simulation.

Stable workaround is described e.g. here: https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337

@benyamin, @fweber and others are already working on the official fix.

So please stay tuned.
 
@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything: :(
 
@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything: :(
This sounds like the scenario where using aio=threads and iothread=1 might help. You must apply this to all guests using VirtIo SCSI Single.
 
I've done some more work on this.
More to follow in coming days...
Thanks, I'm going to go ahead and try that out overnight tonight and see if it helps at all. Appreciate the advice, definitely in a bad spot here and I am not at the same level as you guys when it comes to storage and controller expertise.
 
@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything: :(
I understand your worries. But please note that this VirtIO SCSI error msg is not always misleading.
Here we're focused to driver changes in > 0.1.208, but it's based on assumption that your storage is generally (super) stable.
So still there are cases where hardware / storage / networking problems are the real initiators of this problem.
Especially if you've another VMs frozen. Do you a cluster and this issue is omnipresent on all nodes?
Isn't there a collision with e.g. backup or some another maintenance task?
 
I understand your worries. But please note that this VirtIO SCSI error msg is not always misleading.
Here we're focused to driver changes in > 0.1.208, but it's based on assumption that your storage is generally (super) stable.
So still there are cases where hardware / storage / networking problems are the real initiators of this problem.
Especially if you've another VMs frozen. Do you a cluster and this issue is omnipresent on all nodes?
Isn't there a collision with e.g. backup or some another maintenance task?
I appreciate it, just trying to understand what is going on.

I am fairly sure this issue is related to the driver or minimally something related to the VirtIO SCSI bus. The storage is pretty stable; it's not top-of-the-line Enterprise gear but the builds are HP Proliant using business class SSDs. I am able to push the storage pretty hard on artificial tests, but the error always occurs during medium-size workloads. These nodes are all standalone and not clustered (small business clients). I only have the issue on nodes running Proxmox, the issue is not present on ESXi nodes with similar hardware and workloads. It does not always effect other VMs, only on some ocassions.

Regarding task collision, the issue generally occurs during medium-sized SQL loads or with scheduled backups to a separate disk array (also using VirtIO SCSI). The tasks do not collide at the same time. The SQL databases are for the most part only a couple gigabytes, and they are in a RAID1 configuration so reads should be pretty zippy. I have not been able to identify any issues with hardware or filesystem, unfortunately, which would be a much easier issue to fix.

Notably, still having the issue with aoi=threads and iothread=1, albeit less often.
 
Last edited:
Have you tried as SATA type ? slower but seems a good fix too.
So detaching scsi disk and re-adding it as sata, but still havivng virtio-scsi as controller?
 
So detaching scsi disk and re-adding it as sata, but still havivng virtio-scsi as controller?
yes, controller is always present, but not used when disk is sata or ide.

edit: Windows can require boot once in Safemode to re-enable native SATA or IDE controller.
 
Last edited:
  • Like
Reactions: jsterr

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!