Windows KVM freezes - qmp socket - timeout after 599 retries

encore

Well-Known Member
May 4, 2018
108
1
58
36
Hi,

we are facing issues on all KVM VMs (qcow2, VirtIO SCSI, DIR storage (ext4)) for months now.
These VMs freeze suddenly. Means we are unable to access by RDP or even by Console (VNC).
Syslog shows
May 23 20:30:47 captive020-74050-bl13 qm[8866]: VM 1200247 qmp command failed - VM 1200247 qmp command 'change' failed - unable to connect to VM 1200247 qmp socket - timeout after 599 retries
after crash.

Stopping the frozen VM and trying to start them again lead to:
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 1200192.scope already exists.
There is still a QEMU process showing the VM is running (after stop) so it is not possible to start it again due to the mentioned error message.
10-15mins later, the process is gone and I can start the VM again.

When it happens, several Windows VMs freeze on the same time on that node.
We use Win16 Datacenter R2 and Win19 Datacenter. We tried virtio stable drivers and also virtio latest drivers. Same issue
Our Cluster has 32 nodes with different hardware. All nodes are affected.

root@captive020-74050-bl13:~# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-37
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Any ideas?
 
Last edited:
we had a script running, where we did an "strace" on every guest. The strace command on a qemu-kvm server causes freezes.
 
Same problem here. Haven't found any fixes.
For my problem, I found that a virtual disk on CIFS pool is intermittent, causing the VM to get I/O errors. I used to get "I/O error" on such VMs in situations like this, but now the VM will just continue to be shown as normal.
Removing that disk made the machine stable. Connecting to that pool using NFS also solved the issue.
I ended up using a separate physical hard disk for the job
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!