[SOLVED] VM Freeze - vCPU stuck in kvm_vcpu_block - Only Docker VMs affected

mxpgmbh · Apr 14, 2026

Hi everyone,

I've been dealing with a frustrating issue for a while now and after extensive debugging I've gathered enough data to hopefully get some expert input.
I'm aware that there are already several threads about VM freezes in this forum, but none of them seem to match this specific case, particularly the combination of only Docker VMs being affected and the live capture of a vCPU stuck in kvm_vcpu_block while others spin at 100%.

I have two VMs that randomly freeze completely, no warnings, no errors, nothing.
I've spent a lot of time systematically ruling out the usual suspects and I've managed to capture the exact state of the QEMU process during an active freeze which I hope gives enough information to point me in the right direction.

Environment:

Proxmox VE Version: 9 (Kernel 6.17.13-2-pve)
Host CPU: Intel Xeon E-2288G (16 Cores)
Host RAM: 128 GiB
Storage Backend: NFS (NFSv4.1)
Affected VMs: 2x Debian Linux, both running Docker

Problem Description:
Two VMs freeze randomly on two identical Proxmox environments and completely without any warning or error messages. The freeze is characterized by:

No SSH access possible
No logs written in the guest at the time of the freeze
Proxmox Guest Agent stops responding
No Kernel Panic, no OOM, no D-state processes
Only a hard reset via Proxmox resolves the issue
Both affected VMs run Docker with active containers
Multiple other VMs on the same host and same NFS storage are completely unaffected

The freezes occur randomly with no connection to load, time of day, or specific triggers. Uptime before a freeze ranges from 1 hour to several days.

What we have already investigated and ruled out:

RAM: Sufficient free memory at time of freeze (4+ GiB available, Swap unused)
Ballooning: Min = Max RAM, effectively disabled
KSM: Disabled on host (cat /sys/kernel/mm/ksm/run = 0)
Storage IO: iostat shows no unusual latency or utilization
Network: Multiple other VMs on same bridge/VLAN are stable
Docker FD-Leak: dockerd has only 75-85 open file descriptors
Kernel Panic: kdump configured and ready, but never triggers. Guest kernel does not notice the freeze
MCE/Hardware errors: No Machine Check Exceptions on host or guest
CPU type: Changed from Skylake-Client-v4 to x86-64-v3 problem persists
iothread: Disabled, problem persists
NFS timeouts: Present in dmesg but on a different NAS than where VM disks reside

Key Finding, Live capture during freeze:
During an active freeze, while connected via SSH to the host, we captured the QEMU process state:

Bash:

top -H -p <PID> -b -n 1

PID       COMMAND      %CPU
3944327   CPU 0/KVM    90.9%
3944329   CPU 2/KVM    90.9%
3944330   CPU 3/KVM    90.9%
3944328   CPU 1/KVM     0.0%   ← stuck

Bash:

cat /proc/<PID>/task/*/wchan:
kvm_vcpu_block        ← CPU 1 stuck here
kvm_nx_huge_page_recovery_worker
vhost_task_fn
futex_do_wait
io_wq_worker
poll_schedule_timeout.constprop.0

One vCPU (CPU 1) is stuck in kvm_vcpu_block while the other 3 vCPUs are running at 100%. The QEMU main process itself is in normal sleep state (S).

VM Configuration:

Code:

cpu: x86-64-v3
cores: 4
memory: 8192
balloon: 0
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr112
scsi0: nfs-storage:vmid/vm-disk-0.qcow2,discard=on,iothread=1,size=20G,ssd=1
scsihw: virtio-scsi-single

Guest details:

Code:

OS: Debian Linux
Kernel: 6.19.6+deb13-amd64
Workload: Docker with multiple containers (overlay2 storage driver)

NFS Mount options on host:

Code:

nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,hard,
fatal_neterrors=none,proto=tcp,timeo=600,retrans=2,sec=sys)

NFS timeouts in dmesg (different NAS than VM storage):

Code:

[867876.167796] nfs: server 192.168.xxx.xxx not responding, timed out
[868836.729777] nfs: server 192.168.xxx.xxx not responding, timed out
[872640.708836] nfs: server 192.168.xxx.xxx not responding, timed out

Summary:
The core finding is a vCPU deadlock where one vCPU gets permanently stuck in kvm_vcpu_block while the others spin at 100%.
The guest kernel never notices and therefore never panics or logs anything, making this extremely hard to debug from the guest side alone.

The most notable pattern is that only Docker VMs are affected while all other VMs on identical hardware, network and storage remain completely stable.
No changes to CPU type, iothread settings or memory configuration have had any effect.

Question:
Has anyone seen this specific pattern of one vCPU stuck in kvm_vcpu_block with others spinning at 100%?
Is there a known KVM bug or workaround for this scenario with Docker workloads on NFS-backed storage?
Are there any kernel parameters, QEMU options or NFS tuning options worth trying?

Any help or pointers would be greatly appreciated. Thanks in advance!

Sincerely,
Philipp

vtardiveau · Apr 16, 2026

Hi,

I have had a pretty similar problem for the past few weeks, but with a very different setup.
In a VMware cluster, across many Fedora CoreOS VMs, only the one with Docker crashes very suddenly and randomly.
Same symptoms: CPU at 100%, no response (via the console or the network).
Maybe the problem is more on the Docker side.

mxpgmbh · Apr 16, 2026

Hi,
interesting.
Which Docker version do you have installed on your VM?
We had Docker 29.3.1 installed on both of our VMs. Today I upgraded both of them to version 29.4.0.
An identical server on our end that has had no issues was already running 29.4.0.

I'll continue to monitor the situation and will report back with an update.
Honestly, I tested everything I could think of, but it never crossed my mind that the issue might actually be a bug on Docker's end.

helios-it · Apr 17, 2026

We have the same problem since about two weeks.
Fedora CoreOS for a Docker environment
NAME="Fedora Linux"
VERSION="43.20260316.3.1 (CoreOS)"
RELEASE_TYPE=stable
ID=fedora
VERSION_ID=43
VERSION_CODENAME=""
PRETTY_NAME="Fedora CoreOS 43.20260316.3.1"

Docker version:
Docker version 29.3.0, build 1.fc43

Proxmox VE 9.1.6
Error description:
Ping to the VM -> ok
Console login -> not possible
SSH to the CoreOS -> not possible
Proxmox VE shutdown -> not possible
Proxmox VE shutdown ->reset possible

The problem has now occurred three times in two weeks.

mxpgmbh · Apr 21, 2026

After upgrading Docker to version 29.4.0, we haven't had a single freeze anymore.
As it looks, this really seems to be a problem with Docker version 29.3.X.

vtardiveau · May 12, 2026

mxpgmbh said:
Hi,
interesting.
Which Docker version do you have installed on your VM?
We had Docker 29.3.1 installed on both of our VMs. Today I upgraded both of them to version 29.4.0.
An identical server on our end that has had no issues was already running 29.4.0.

I'll continue to monitor the situation and will report back with an update.
Honestly, I tested everything I could think of, but it never crossed my mind that the issue might actually be a bug on Docker's end.

Sorry for the late reply.

We also had Docker version 29.3.1 installed.
The VMs have since been updated to 29.4.1, and we haven't had any issues since.

[SOLVED] VM Freeze - vCPU stuck in kvm_vcpu_block - Only Docker VMs affected

mxpgmbh

New Member

vtardiveau

New Member

mxpgmbh

New Member

helios-it

New Member

mxpgmbh

New Member

vtardiveau

New Member

We value your privacy