@hybridencounter Thanks for your data. Let's see if with data from both cases we can narrow down the issue so developers can finally fix it
Update — further investigation, more data for developers
After more testing today, I can refine the diagnosis considerably.
The bug requires NFS in the path.I tried to reproduce the issue with sustained write load inside the NAS VM itself —copying from the NAS VM's local ZFS pool to a secondary virtual disk backed by the host's NVMe, and 4× parallel dd — over 80 GB without a single issue. The VM handles local I/O fine under heavy load. The deadlock only triggers when NFS is involved. So this is not a pure I/O regression: it sits at the intersection of pve-qemu-kvm 11.0.0-4 + NFS traffic + kernel 6.17.13.
Kernel 7.0.6 is not a universal workaround.It resolved my situation, but @hybridencounter reports it did not help him — only the pve-qemu-kvm downgrade did. So the reliable fix is the downgrade, not the kernel upgrade.
Confirmed fix (both of us):
Restart affected VMs (full system reboot not strictly required).
For developers:The regression is tightly bisectable: 11.0.0-3 = stable, 11.0.0-4 = broken, confirmed by at least two independent users with different setups. The 11.0.0-4 changelog mentions backported fixes for aio=native, virtio-blk and aio. One of those commits appears to affect the NFS path specifically — possibly related to how the guest virtual network or disk stack interacts with sustained NFS write traffic. Local VM I/O is unaffected, which suggests the trigger involves the virtual network layer, not just disk I/O.
Call trace from the host dmesg at the moment of deadlock (kernel 6.17.13-13, pve-qemu-kvm 11.0.0-4):
Same trace reproduced consistently on 6.17.13-1, -11, and -13. Zero occurrences with 11.0.0-3 under identical load.
Hope this helps narrow it down.
Update — further investigation, more data for developers
After more testing today, I can refine the diagnosis considerably.
The bug requires NFS in the path.I tried to reproduce the issue with sustained write load inside the NAS VM itself —copying from the NAS VM's local ZFS pool to a secondary virtual disk backed by the host's NVMe, and 4× parallel dd — over 80 GB without a single issue. The VM handles local I/O fine under heavy load. The deadlock only triggers when NFS is involved. So this is not a pure I/O regression: it sits at the intersection of pve-qemu-kvm 11.0.0-4 + NFS traffic + kernel 6.17.13.
Kernel 7.0.6 is not a universal workaround.It resolved my situation, but @hybridencounter reports it did not help him — only the pve-qemu-kvm downgrade did. So the reliable fix is the downgrade, not the kernel upgrade.
Confirmed fix (both of us):
Code:
apt install pve-qemu-kvm=11.0.0-3
apt-mark hold pve-qemu-kvm
Restart affected VMs (full system reboot not strictly required).
For developers:The regression is tightly bisectable: 11.0.0-3 = stable, 11.0.0-4 = broken, confirmed by at least two independent users with different setups. The 11.0.0-4 changelog mentions backported fixes for aio=native, virtio-blk and aio. One of those commits appears to affect the NFS path specifically — possibly related to how the guest virtual network or disk stack interacts with sustained NFS write traffic. Local VM I/O is unaffected, which suggests the trigger involves the virtual network layer, not just disk I/O.
Call trace from the host dmesg at the moment of deadlock (kernel 6.17.13-13, pve-qemu-kvm 11.0.0-4):
Code:
nfs: server nas not responding, still trying
INFO: task CPU x/KVM blocked for more than 122 seconds.
nfs_wb_folio → folio_wait_writeback
→ __folio_split → migrate_pages_batch
→ kvm_mmu_faultin_pfn → npf_interception [kvm_amd]
INFO: task worker:xxxx is blocked on an rw-semaphore,
but the owner is not found. (×N workers)
Same trace reproduced consistently on 6.17.13-1, -11, and -13. Zero occurrences with 11.0.0-3 under identical load.
Hope this helps narrow it down.