Hi everyone,
I've been dealing with a frustrating issue for a while now and after extensive debugging I've gathered enough data to hopefully get some expert input.
I'm aware that there are already several threads about VM freezes in this forum, but none of them seem to match this specific case, particularly the combination of only Docker VMs being affected and the live capture of a vCPU stuck in
I have two VMs that randomly freeze completely, no warnings, no errors, nothing.
I've spent a lot of time systematically ruling out the usual suspects and I've managed to capture the exact state of the QEMU process during an active freeze which I hope gives enough information to point me in the right direction.
Environment:
Problem Description:
Two VMs freeze randomly on two identical Proxmox environments and completely without any warning or error messages. The freeze is characterized by:
What we have already investigated and ruled out:
Key Finding, Live capture during freeze:
During an active freeze, while connected via SSH to the host, we captured the QEMU process state:
One vCPU (CPU 1) is stuck in kvm_vcpu_block while the other 3 vCPUs are running at 100%. The QEMU main process itself is in normal sleep state (S).
VM Configuration:
Guest details:
NFS Mount options on host:
NFS timeouts in dmesg (different NAS than VM storage):
Summary:
The core finding is a vCPU deadlock where one vCPU gets permanently stuck in kvm_vcpu_block while the others spin at 100%.
The guest kernel never notices and therefore never panics or logs anything, making this extremely hard to debug from the guest side alone.
The most notable pattern is that only Docker VMs are affected while all other VMs on identical hardware, network and storage remain completely stable.
No changes to CPU type, iothread settings or memory configuration have had any effect.
Question:
Has anyone seen this specific pattern of one vCPU stuck in
Is there a known KVM bug or workaround for this scenario with Docker workloads on NFS-backed storage?
Are there any kernel parameters, QEMU options or NFS tuning options worth trying?
Any help or pointers would be greatly appreciated. Thanks in advance!
Sincerely,
Philipp
I've been dealing with a frustrating issue for a while now and after extensive debugging I've gathered enough data to hopefully get some expert input.
I'm aware that there are already several threads about VM freezes in this forum, but none of them seem to match this specific case, particularly the combination of only Docker VMs being affected and the live capture of a vCPU stuck in
kvm_vcpu_block while others spin at 100%.I have two VMs that randomly freeze completely, no warnings, no errors, nothing.
I've spent a lot of time systematically ruling out the usual suspects and I've managed to capture the exact state of the QEMU process during an active freeze which I hope gives enough information to point me in the right direction.
Environment:
- Proxmox VE Version: 9 (Kernel
6.17.13-2-pve) - Host CPU: Intel Xeon E-2288G (16 Cores)
- Host RAM: 128 GiB
- Storage Backend: NFS (NFSv4.1)
- Affected VMs: 2x Debian Linux, both running Docker
Problem Description:
Two VMs freeze randomly on two identical Proxmox environments and completely without any warning or error messages. The freeze is characterized by:
- No SSH access possible
- No logs written in the guest at the time of the freeze
- Proxmox Guest Agent stops responding
- No Kernel Panic, no OOM, no D-state processes
- Only a hard reset via Proxmox resolves the issue
- Both affected VMs run Docker with active containers
- Multiple other VMs on the same host and same NFS storage are completely unaffected
What we have already investigated and ruled out:
- RAM: Sufficient free memory at time of freeze (4+ GiB available, Swap unused)
- Ballooning: Min = Max RAM, effectively disabled
- KSM: Disabled on host (
cat /sys/kernel/mm/ksm/run= 0) - Storage IO:
iostatshows no unusual latency or utilization - Network: Multiple other VMs on same bridge/VLAN are stable
- Docker FD-Leak: dockerd has only 75-85 open file descriptors
- Kernel Panic: kdump configured and ready, but never triggers. Guest kernel does not notice the freeze
- MCE/Hardware errors: No Machine Check Exceptions on host or guest
- CPU type: Changed from
Skylake-Client-v4tox86-64-v3problem persists - iothread: Disabled, problem persists
- NFS timeouts: Present in dmesg but on a different NAS than where VM disks reside
Key Finding, Live capture during freeze:
During an active freeze, while connected via SSH to the host, we captured the QEMU process state:
Bash:
top -H -p <PID> -b -n 1
PID COMMAND %CPU
3944327 CPU 0/KVM 90.9%
3944329 CPU 2/KVM 90.9%
3944330 CPU 3/KVM 90.9%
3944328 CPU 1/KVM 0.0% ← stuck
Bash:
cat /proc/<PID>/task/*/wchan:
kvm_vcpu_block ← CPU 1 stuck here
kvm_nx_huge_page_recovery_worker
vhost_task_fn
futex_do_wait
io_wq_worker
poll_schedule_timeout.constprop.0
VM Configuration:
Code:
cpu: x86-64-v3
cores: 4
memory: 8192
balloon: 0
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr112
scsi0: nfs-storage:vmid/vm-disk-0.qcow2,discard=on,iothread=1,size=20G,ssd=1
scsihw: virtio-scsi-single
Guest details:
Code:
OS: Debian Linux
Kernel: 6.19.6+deb13-amd64
Workload: Docker with multiple containers (overlay2 storage driver)
NFS Mount options on host:
Code:
nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,hard,
fatal_neterrors=none,proto=tcp,timeo=600,retrans=2,sec=sys)
NFS timeouts in dmesg (different NAS than VM storage):
Code:
[867876.167796] nfs: server 192.168.xxx.xxx not responding, timed out
[868836.729777] nfs: server 192.168.xxx.xxx not responding, timed out
[872640.708836] nfs: server 192.168.xxx.xxx not responding, timed out
Summary:
The core finding is a vCPU deadlock where one vCPU gets permanently stuck in kvm_vcpu_block while the others spin at 100%.
The guest kernel never notices and therefore never panics or logs anything, making this extremely hard to debug from the guest side alone.
The most notable pattern is that only Docker VMs are affected while all other VMs on identical hardware, network and storage remain completely stable.
No changes to CPU type, iothread settings or memory configuration have had any effect.
Question:
Has anyone seen this specific pattern of one vCPU stuck in
kvm_vcpu_block with others spinning at 100%?Is there a known KVM bug or workaround for this scenario with Docker workloads on NFS-backed storage?
Are there any kernel parameters, QEMU options or NFS tuning options worth trying?
Any help or pointers would be greatly appreciated. Thanks in advance!
Sincerely,
Philipp