VM Freeze - vCPU stuck in kvm_vcpu_block - Only Docker VMs affected

Apr 14, 2026
1
0
1
Germany
www.mxp.de
Hi everyone,

I've been dealing with a frustrating issue for a while now and after extensive debugging I've gathered enough data to hopefully get some expert input.
I'm aware that there are already several threads about VM freezes in this forum, but none of them seem to match this specific case, particularly the combination of only Docker VMs being affected and the live capture of a vCPU stuck in kvm_vcpu_block while others spin at 100%.

I have two VMs that randomly freeze completely, no warnings, no errors, nothing.
I've spent a lot of time systematically ruling out the usual suspects and I've managed to capture the exact state of the QEMU process during an active freeze which I hope gives enough information to point me in the right direction.

Environment:
  • Proxmox VE Version: 9 (Kernel 6.17.13-2-pve)
  • Host CPU: Intel Xeon E-2288G (16 Cores)
  • Host RAM: 128 GiB
  • Storage Backend: NFS (NFSv4.1)
  • Affected VMs: 2x Debian Linux, both running Docker



Problem Description:
Two VMs freeze randomly on two identical Proxmox environments and completely without any warning or error messages. The freeze is characterized by:
  • No SSH access possible
  • No logs written in the guest at the time of the freeze
  • Proxmox Guest Agent stops responding
  • No Kernel Panic, no OOM, no D-state processes
  • Only a hard reset via Proxmox resolves the issue
  • Both affected VMs run Docker with active containers
  • Multiple other VMs on the same host and same NFS storage are completely unaffected
The freezes occur randomly with no connection to load, time of day, or specific triggers. Uptime before a freeze ranges from 1 hour to several days.



What we have already investigated and ruled out:
  • RAM: Sufficient free memory at time of freeze (4+ GiB available, Swap unused)
  • Ballooning: Min = Max RAM, effectively disabled
  • KSM: Disabled on host (cat /sys/kernel/mm/ksm/run = 0)
  • Storage IO: iostat shows no unusual latency or utilization
  • Network: Multiple other VMs on same bridge/VLAN are stable
  • Docker FD-Leak: dockerd has only 75-85 open file descriptors
  • Kernel Panic: kdump configured and ready, but never triggers. Guest kernel does not notice the freeze
  • MCE/Hardware errors: No Machine Check Exceptions on host or guest
  • CPU type: Changed from Skylake-Client-v4 to x86-64-v3 problem persists
  • iothread: Disabled, problem persists
  • NFS timeouts: Present in dmesg but on a different NAS than where VM disks reside



Key Finding, Live capture during freeze:
During an active freeze, while connected via SSH to the host, we captured the QEMU process state:

Bash:
top -H -p <PID> -b -n 1

PID       COMMAND      %CPU
3944327   CPU 0/KVM    90.9%
3944329   CPU 2/KVM    90.9%
3944330   CPU 3/KVM    90.9%
3944328   CPU 1/KVM     0.0%   ← stuck
Bash:
cat /proc/<PID>/task/*/wchan:
kvm_vcpu_block        ← CPU 1 stuck here
kvm_nx_huge_page_recovery_worker
vhost_task_fn
futex_do_wait
io_wq_worker
poll_schedule_timeout.constprop.0
One vCPU (CPU 1) is stuck in kvm_vcpu_block while the other 3 vCPUs are running at 100%. The QEMU main process itself is in normal sleep state (S).



VM Configuration:
Code:
cpu: x86-64-v3
cores: 4
memory: 8192
balloon: 0
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr112
scsi0: nfs-storage:vmid/vm-disk-0.qcow2,discard=on,iothread=1,size=20G,ssd=1
scsihw: virtio-scsi-single



Guest details:
Code:
OS: Debian Linux
Kernel: 6.19.6+deb13-amd64
Workload: Docker with multiple containers (overlay2 storage driver)



NFS Mount options on host:
Code:
nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,hard,
fatal_neterrors=none,proto=tcp,timeo=600,retrans=2,sec=sys)



NFS timeouts in dmesg (different NAS than VM storage):
Code:
[867876.167796] nfs: server 192.168.xxx.xxx not responding, timed out
[868836.729777] nfs: server 192.168.xxx.xxx not responding, timed out
[872640.708836] nfs: server 192.168.xxx.xxx not responding, timed out



Summary:
The core finding is a vCPU deadlock where one vCPU gets permanently stuck in kvm_vcpu_block while the others spin at 100%.
The guest kernel never notices and therefore never panics or logs anything, making this extremely hard to debug from the guest side alone.

The most notable pattern is that only Docker VMs are affected while all other VMs on identical hardware, network and storage remain completely stable.
No changes to CPU type, iothread settings or memory configuration have had any effect.



Question:
Has anyone seen this specific pattern of one vCPU stuck in kvm_vcpu_block with others spinning at 100%?
Is there a known KVM bug or workaround for this scenario with Docker workloads on NFS-backed storage?
Are there any kernel parameters, QEMU options or NFS tuning options worth trying?

Any help or pointers would be greatly appreciated. Thanks in advance!

Sincerely,
Philipp