And Friedrich (@fweber),Hi @fweber,
I've similar findings as @benyamin, but not exactly the same.
Tests done on my VM102 (https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337, but the corresponding PVE node is no longer empty)
Both scsi drives on aio=io_uring.
1. VirtIO 0.1.208 (vgs=6): RDP is stable, IO stable/max, no issues
2. VirtIO 0.1.208 + CPU hotplug (vgs=3): huge RDP hangs, IO works, but lower Q32T16 Read performance
3. VirtIO 0.1.240 + CPU hotplug (vgs=3): medium RDP hangs, IO works, but lower Q32T16 Read performance
4. VirtIO 0.1.240 (vgs=6): buggy as reported
As always, using RND 4K Q32T16 test was super-easy to invoke the bug.
So, there are no IO hangs / scsi alerts (with CPU hotplug / vgs = 3) on 0.1.240, but while the VM works, it appears to be unresponsive for up to 10s (for both 0.1.208 and 0.1.240), but data flow continues. So from user perspective it's buggy, but in reality it works (with lower IO performance) and the virtio problem is mitigated. "RDP hang" is not the exact definition - just the session seemed to be hanged, although this could be caused by multiple factors in this setup (some general resource congestion is evident here).
Long story short: This is a good case for debug, but not suitable setup for production.
Update: to be clear, iothread=1, SCSI Single and the tested drive was scsi1.
regarding to your latest github post (https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756#issuecomment-2293748551),
please let me note that I'm able to invoke the bug with 4K request size as well (with aio=io_uring - being the most sensitive setup with SCSI Single and iothread=1 - it's almost 100%).
While I fully accept the idea of the newer kernels influence, still there have to be some clear presumption what is the link to 0.1.215 changes.
And as I also highly appreciate all @benyamin 's extreme efforts, which may lead to higher bandwidth, better stability and even to the final solution, I'll never stop thinking about the exact root cause in 0.1.215 - simply to prevent the same bug to be introduced once again in the future versions.
So in my view, there are two more or less independent "research rails":
1. Make VirtIO better without respect to driver's changes, i.e. resolve the bug and even add some premium (more bandwidth, etc.)
2. Analyze exact root cause, i.e. find the specific change/s in 0.1.215 (in comparison to 0.1.208), as the "1/0 switch" is still present there
Maybe I'm the only one here, but in that gtihub thread it's not always clear what is the scope (or a full idea behind) of each post and sometimes it's going to be messy/chaotic again. And I'm still not sure if the VirtIO guys are fully synced with our findings with some bigger picture.
Anyways, if you both will have some binaries to check out, you can send it to me (in PM, or so) and I'll try to check them too.
And the same is valid for @bbgeek17 as well. Thanks to you all.