apparently, setting virtio-scsi-single & iothread & aio=threads cured all our vm freeze & hiccup issues.
i added this information to:
https://bugzilla.kernel.org/show_bug.cgi?id=199727#c8
https://bugzilla.proxmox.com/show_bug.cgi?id=1453
apparently, in ordinary/default qemu io processing, there is chances to get into larger locking conditions which block entire vm execution and thus entirely freezing the guest cpu for a while . this also explains why ping jitters that much.
when virtio-scsi-single & iothread & aio=native, ping jitter gets cured, too, but the jitter/freeze moves into the iothread instead and i'm still getting kernel traces/oopses regarding stuck processes/cpu.
adding aio=threads solves this entirely.
the following information sends some light into the whole picture, apparently the "qemu_global_mutex" can slam hard in your face and this seems to be very unknown:
https://docs.openeuler.org/en/docs/.../best-practices.html#i-o-thread-configuration
"The QEMU global lock (qemu_global_mutex) is used when VM I/O requests are processed by the QEMU main thread. If the I/O processing takes a long time, the QEMU main thread will occupy the global lock for a long time. As a result, the VM vCPU cannot be scheduled properly, affecting the overall VM performance and user experience."
i have never seen a problem again with virtio-scsi-single & iothread & aio=threads again, ping is absolutely stable with that,also ioping in VM during vm migration or virtual disk move is within reasonable range. it's slow on high io pressure, but no errors in kernel dmesg inside the guests.
i'm really curious, why this problem doesn't affect more people and why it is so hard to find information, that even proxmox folks won't give a hint into this direction (at least i didn't find one, and i searched really long and hard)
I'm still searching for some deeper information/knowledge what exactly happens in qemu/kvm and what is going on in detail that freezes for several tens of secends occur. even in qemu project detailed information on "virtio dataplane is curing vm hiccup/freezing and removing big qemu locking problem" is near to non existing. main context is "it improves performance and user experience".
anywhay, i consider this finding important enough to be added to the docs/faqs. for us, this finding is sort of essential for survival, our whole xen proxmox migration was delayed for months because of those vm hiccup/freeze issues.
what do you think
@proxmox-team ?