Hi there,
I recently upgraded my PVE host to v9.2.3 and since then just one of my VMs is in trouble.
It stalls frequently for up to half an hour. During that time it won't even do a guest ping and stops Windows hearbeat. Windows Events records nothing and I cannot access it during that time, making troubleshooting difficult. During the time of stall sometimes CPU load is a permanent 100% on the guest, sometimes not quite. RAM - in contrast to others - is always 100% as seen from host, but just 3/8 GB from within the guest. It also stalls with virtually no application running .. apart from some standard startup services.
All other guests, also identically configged ones on the host are fine and the host itself shows no sign of trouble/overload/resource pressure whatsoever. I already recovered from a recent backup to no avail, so it seems not to be about corruption of the disk image.
Any ideas how to troubleshoot this? Below some specs and actions I took.
Thanks!
VM config:
- Windows 11 guest Update 05/26, OVMF/UEFI + TPM 2.0
- machine: pc-q35-8.1
- cpu: host
- sockets: 1, cores: 4
- memory: 8192 MB
- balloon: 0
- scsihw: virtio-scsi-single
- disk: scsi0 on ZFS storage, discard=on, iothread=1, ssd=1, 60G
- network: 2x virtio NICs, later disabled for testing
- qemu guest agent enabled
- storage backend: ZFS dataset
Symptoms:
- VM randomly stalls/hangs, initially noticed via RDP disconnects.
- During hang, Windows heartbeat logging stops.
- qm guest ping fails during the stall.
- PVE/top sometimes shows all 4 KVM vCPU threads at ~100% CPU during the stall.
- Host itself is not under pressure: no swap, plenty RAM, no relevant iowait.
- Other guests on same host do not show the issue.
Things already excluded/tested:
- Not just RDP: Other tools/Console also disconnect/freezes.
- Not network-related: NICs disabled and VM still stalled.
- Not obvious storage stall: Windows disk queue did not spike.
- No relevant Windows System/Application events around the hang.
- No clear process culprit before stall; only moderate CPU users seen.
- Restored PBS backup from before PVE upgrade to separate VMID; it shows same failure.
- Therefore current VM disk corruption after upgrade seems unlikely.
I recently upgraded my PVE host to v9.2.3 and since then just one of my VMs is in trouble.
It stalls frequently for up to half an hour. During that time it won't even do a guest ping and stops Windows hearbeat. Windows Events records nothing and I cannot access it during that time, making troubleshooting difficult. During the time of stall sometimes CPU load is a permanent 100% on the guest, sometimes not quite. RAM - in contrast to others - is always 100% as seen from host, but just 3/8 GB from within the guest. It also stalls with virtually no application running .. apart from some standard startup services.
All other guests, also identically configged ones on the host are fine and the host itself shows no sign of trouble/overload/resource pressure whatsoever. I already recovered from a recent backup to no avail, so it seems not to be about corruption of the disk image.
Any ideas how to troubleshoot this? Below some specs and actions I took.
Thanks!
VM config:
- Windows 11 guest Update 05/26, OVMF/UEFI + TPM 2.0
- machine: pc-q35-8.1
- cpu: host
- sockets: 1, cores: 4
- memory: 8192 MB
- balloon: 0
- scsihw: virtio-scsi-single
- disk: scsi0 on ZFS storage, discard=on, iothread=1, ssd=1, 60G
- network: 2x virtio NICs, later disabled for testing
- qemu guest agent enabled
- storage backend: ZFS dataset
Symptoms:
- VM randomly stalls/hangs, initially noticed via RDP disconnects.
- During hang, Windows heartbeat logging stops.
- qm guest ping fails during the stall.
- PVE/top sometimes shows all 4 KVM vCPU threads at ~100% CPU during the stall.
- Host itself is not under pressure: no swap, plenty RAM, no relevant iowait.
- Other guests on same host do not show the issue.
Things already excluded/tested:
- Not just RDP: Other tools/Console also disconnect/freezes.
- Not network-related: NICs disabled and VM still stalled.
- Not obvious storage stall: Windows disk queue did not spike.
- No relevant Windows System/Application events around the hang.
- No clear process culprit before stall; only moderate CPU users seen.
- Restored PBS backup from before PVE upgrade to separate VMID; it shows same failure.
- Therefore current VM disk corruption after upgrade seems unlikely.
Last edited: