Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

RoCE-geek · Aug 17, 2024

RoCE-geek said:
Hi @fweber,

I've similar findings as @benyamin, but not exactly the same.

Tests done on my VM102 (https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337, but the corresponding PVE node is no longer empty)

Both scsi drives on aio=io_uring.

1. VirtIO 0.1.208 (vgs=6): RDP is stable, IO stable/max, no issues
2. VirtIO 0.1.208 + CPU hotplug (vgs=3): huge RDP hangs, IO works, but lower Q32T16 Read performance
3. VirtIO 0.1.240 + CPU hotplug (vgs=3): medium RDP hangs, IO works, but lower Q32T16 Read performance
4. VirtIO 0.1.240 (vgs=6): buggy as reported

As always, using RND 4K Q32T16 test was super-easy to invoke the bug.

So, there are no IO hangs / scsi alerts (with CPU hotplug / vgs = 3) on 0.1.240, but while the VM works, it appears to be unresponsive for up to 10s (for both 0.1.208 and 0.1.240), but data flow continues. So from user perspective it's buggy, but in reality it works (with lower IO performance) and the virtio problem is mitigated. "RDP hang" is not the exact definition - just the session seemed to be hanged, although this could be caused by multiple factors in this setup (some general resource congestion is evident here).

Long story short: This is a good case for debug, but not suitable setup for production.

Update: to be clear, iothread=1, SCSI Single and the tested drive was scsi1.

And Friedrich (@fweber),

regarding to your latest github post (https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756#issuecomment-2293748551),
please let me note that I'm able to invoke the bug with 4K request size as well (with aio=io_uring - being the most sensitive setup with SCSI Single and iothread=1 - it's almost 100%).

While I fully accept the idea of the newer kernels influence, still there have to be some clear presumption what is the link to 0.1.215 changes.

And as I also highly appreciate all @benyamin 's extreme efforts, which may lead to higher bandwidth, better stability and even to the final solution, I'll never stop thinking about the exact root cause in 0.1.215 - simply to prevent the same bug to be introduced once again in the future versions.

So in my view, there are two more or less independent "research rails":

1. Make VirtIO better without respect to driver's changes, i.e. resolve the bug and even add some premium (more bandwidth, etc.)
2. Analyze exact root cause, i.e. find the specific change/s in 0.1.215 (in comparison to 0.1.208), as the "1/0 switch" is still present there

Maybe I'm the only one here, but in that gtihub thread it's not always clear what is the scope (or a full idea behind) of each post and sometimes it's going to be messy/chaotic again. And I'm still not sure if the VirtIO guys are fully synced with our findings with some bigger picture.

Anyways, if you both will have some binaries to check out, you can send it to me (in PM, or so) and I'll try to check them too.

And the same is valid for @bbgeek17 as well. Thanks to you all.

benyamin · Aug 18, 2024

Almost finished testing...

benyamin · Aug 20, 2024

I've dropped my PR at GitHub: https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1135

@bbgeek17, were you able to take my first cut for a spin?

EDIT: my typos were bothering me... 8^d

_gabriel · Aug 20, 2024

can you send as PM an unsigned built driver ? to do additional tests.

RoCE-geek · Aug 20, 2024

benyamin said:
I've dropped my PR at GitHub: https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1135

@bbgeek17, were you able my first cut a spin?

Excellent!

Clean & Clear... so you did both: Bug resolution + further optimizations in one (or two) PRs... And I hope that SRB is the only cause, seems promising.

Please send me the binaries as well.

benyamin · Aug 21, 2024

It's not really ready for showtime ... yet.

There's clearly a problem with multi-queue. I'm getting wildly different results with iothreads=1 and VirtIO SCSI Single - and not in a good way. The performance is much better with just one adapter and just one disk, e.g. for 1MiB block size sequential reads with 4x vCPU, diskspd shows that when the multi-queue implementation is compared to a single HBA, the latter is 5.82x better performing when using -o8 -t4 -si and 11.55x better performing when using -o8 -t4 -T4k -s4096k (@ ~28.7GB/s vs. ~4.7GB/s and ~2.5GB/s respectively).

Whilst one HBA might be the use case for most guests, the underlying issue affects the ability to reliably perform I/O to such an extent that performance stats require interpretation. This is true with the v208 driver too, and fixing the issue there on master results in even more unstable performance. The additional optimisation work helps stabilise the driver, but it is really just masking the underlying problem.

This underlying issue has been around for some time, well before even v208. Let's see where I can get with solving it over the next week...

emunt6 · Aug 24, 2024

I already told in other thread, virtio emulation is not stable/reliable for WINDOWS based VMs.

You can easily reproduce the "disk device reset" bug at any time:
> Install MS SQL Server with larger database (example: WSUS with MSSQL backend),
> Add virtio disk to the VM ("backup-disk"),
> Create MSSQL-backup-job to the "backup-disk",
> Run MSSQL-backup-job, BOOM -> you hit the "feature/bug: disk device reset".

RoCE-geek · Aug 24, 2024

emunt6 said:
I already told in other thread, virtio emulation is not stable/reliable for WINDOWS based VMs.

You can easily reproduce the "disk device reset" bug at any time:
> Install MS SQL Server with larger database (example: WSUS with MSSQL backend),
> Add virtio disk to the VM ("backup-disk"),
> Create MSSQL-backup-job to the "backup-disk",
> Run MSSQL-backup-job, BOOM -> you hit the "feature/bug: disk device reset".

Hi, all this is known to us, i.e. we've no problems with the bug simulation.

Stable workaround is described e.g. here: https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337

@benyamin, @fweber and others are already working on the official fix.

So please stay tuned.

dsjfshdfjklsdjfkkjv · Sep 1, 2024

@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything:

benyamin · Sep 1, 2024

dsjfshdfjklsdjfkkjv said:
@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything:

This sounds like the scenario where using aio=threads and iothread=1 might help. You must apply this to all guests using VirtIo SCSI Single.

benyamin · Sep 1, 2024

RoCE-geek said:
@benyamin, @fweber and others are already working on the official fix.

So please stay tuned.

I've done some more work on this.
More to follow in coming days...

dsjfshdfjklsdjfkkjv · Sep 1, 2024

benyamin said:
I've done some more work on this.
More to follow in coming days...

Thanks, I'm going to go ahead and try that out overnight tonight and see if it helps at all. Appreciate the advice, definitely in a bad spot here and I am not at the same level as you guys when it comes to storage and controller expertise.

RoCE-geek · Sep 1, 2024

dsjfshdfjklsdjfkkjv said:
@RoCE-geek unfortunately I am experiencing this issue even with the workaround applied. Have tried all of the configurations but even with 0.1.208 drivers, the reset errors still persist. I think I'm going to have to try abandoning VirtIO altogether.

Unfortunately, contrary to your experience, when the reset issue occurs it effects every other VM using the physical disk and they all start experiencing cascading errors. It brings MS SQL to a halt. I've got a couple clients very unhappy with me at the moment and I'm scrambling to find a working solution but have not been able to find anything:

I understand your worries. But please note that this VirtIO SCSI error msg is not always misleading.
Here we're focused to driver changes in > 0.1.208, but it's based on assumption that your storage is generally (super) stable.
So still there are cases where hardware / storage / networking problems are the real initiators of this problem.
Especially if you've another VMs frozen. Do you a cluster and this issue is omnipresent on all nodes?
Isn't there a collision with e.g. backup or some another maintenance task?

dsjfshdfjklsdjfkkjv · Sep 3, 2024

RoCE-geek said:
I understand your worries. But please note that this VirtIO SCSI error msg is not always misleading.
Here we're focused to driver changes in > 0.1.208, but it's based on assumption that your storage is generally (super) stable.
So still there are cases where hardware / storage / networking problems are the real initiators of this problem.
Especially if you've another VMs frozen. Do you a cluster and this issue is omnipresent on all nodes?
Isn't there a collision with e.g. backup or some another maintenance task?

I appreciate it, just trying to understand what is going on.

I am fairly sure this issue is related to the driver or minimally something related to the VirtIO SCSI bus. The storage is pretty stable; it's not top-of-the-line Enterprise gear but the builds are HP Proliant using business class SSDs. I am able to push the storage pretty hard on artificial tests, but the error always occurs during medium-size workloads. These nodes are all standalone and not clustered (small business clients). I only have the issue on nodes running Proxmox, the issue is not present on ESXi nodes with similar hardware and workloads. It does not always effect other VMs, only on some ocassions.

Regarding task collision, the issue generally occurs during medium-sized SQL loads or with scheduled backups to a separate disk array (also using VirtIO SCSI). The tasks do not collide at the same time. The SQL databases are for the most part only a couple gigabytes, and they are in a RAID1 configuration so reads should be pretty zippy. I have not been able to identify any issues with hardware or filesystem, unfortunately, which would be a much easier issue to fix.

Notably, still having the issue with aoi=threads and iothread=1, albeit less often.

jsterr · Sep 3, 2024

Sorry for interrupting. Has someone a summary about the current state:

* Whats the error that happens (in short)
* What might be the root-cause for the error?
* List of workarounds until theres a fix?

Thank you for the help. We already tried downgrading driver + io_uring = Native. Is this current state (still):
https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337

_gabriel · Sep 3, 2024

jsterr said:
Is this current state (still)

yes

_gabriel · Sep 3, 2024

jsterr said:
List of workarounds until theres a fix?

Have you tried as SATA type ? slower but seems a good fix too.

jsterr · Sep 3, 2024

_gabriel said:
Have you tried as SATA type ? slower but seems a good fix too.

So detaching scsi disk and re-adding it as sata, but still havivng virtio-scsi as controller?

_gabriel · Sep 3, 2024

jsterr said:
So detaching scsi disk and re-adding it as sata, but still havivng virtio-scsi as controller?

yes, controller is always present, but not used when disk is sata or ide.

edit: Windows can require boot once in Safemode to re-enable native SATA or IDE controller.

complexplaster27 · Sep 4, 2024

jsterr said:
So detaching scsi disk and re-adding it as sata, but still havivng virtio-scsi as controller?

I've found detaching the disk and re-adding it as SATA and then switching the SCSI Controller to Default (LSI 53C895A) eliminates the issue completely.

Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

Member

Member

Member

Famous Member

Member

Member

Active Member

Member

New Member

Member

Member

New Member

Member

New Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Member

We value your privacy