QMP Socket timeout on console, backups etc... randomly

Jema

New Member
Jun 3, 2019
20
0
1
44
We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.

We run all latest updates and run an NVME Ceph cluster.
Some VM's (KVM) work fine and console is accessible, but some do not on exactly the same hypervisor. It has nothing to do with the OS being installed on it, as we noticed this issue currently on both Linux and Windows.

Does anyone have any clue how to debug this and find the cause?

Starting a VM sometimes doesn't work either:

Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 192 -name telegram.rare.com -chardev 'socket,id=qmp,path=/var/run/qemu-server/192.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/192.pid -daemonize -smbios 'type=1,uuid=c4e1b094-d09c-49df-bae7-fa2c64fb848f' -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/192.vnc,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,enforce' -m 4000 -object 'memory-backend-ram,id=ram-node0,size=4000M' -numa 'node,nodeid=0,cpus=0-1,memdev=ram-node0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:9617bc5c7589' -drive 'file=rbd:nvme01/vm-192-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/nvme01.keyring,if=none,id=drive-ide0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap192i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=1A:3A:22:6F:3A:FB,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout

Trying to access the console also gives the timeout:

Code:
()
VM 155 qmp command 'change' failed - unable to connect to VM 155 qmp socket - timeout after 599 retries
TASK ERROR: Failed to run vncproxy.


Backups also fail randomly:


Code:
VMID    NAME    STATUS    TIME    SIZE    FILENAME
112    my.server1.com    OK    00:00:54    1.46GB    /mnt/pve/hyp08-backup/dump/vzdump-qemu-112-2020_02_18-05_00_02.vma.lzo
117    my.server2.com    FAILED    00:00:10    got timeout
121    my.server3.com    FAILED    00:00:13    got timeout
126    my.server4.com    OK    00:03:13    16.46GB    /mnt/pve/hyp08-backup/dump/vzdump-qemu-126-2020_02_18-05_01_19.vma.lzo
 
Hi,

do you see any Errors in the log?
 
We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.
so this does not just happen during backups but also once no backup job is running? Is the backup target a NFS?

We got some fixes to for QMP timeouts with improved locking in pve-qemu-kvm 4.1.1-3 it's available through our pvetest repo as of just now. https://pve.proxmox.com/wiki/Package_Repositories
 
Backups time out on some VM's but succeed on others. It happens also in the GUI when trying to open the console for example. Copying the VM over to another hypervisor also won't work, we have to take the VM down and then the copy succeeds. Only on the new hypervisor the same behavior occurs.

If we re-create the VM or restore a backup, everything works again, but this is no solution.

Which log files should I be checking exactly?

@t.lamprecht

I will update the pve-qemu-kvm version and update you if it solves anything.

update: I have updated the package, restarted the VM and console works now. So this seems to solve it!
 
@t.lamprecht

Unfortunately I was excited too quickly. It worked ok on 1 VM, but just tested another and it keeps on timing out on it, and I'm referring to the Console display. It keeps on "Connecting..." and there is no further display.

Code:
()
VM 192 qmp command 'change' failed - unable to connect to VM 192 qmp socket - timeout after 598 retries
TASK ERROR: Failed to run vncproxy.
 
Does anyone have any clue please? We experience this on multiple hypervisors.
 
We still don't have a solution and it starts to happen randomly on other VM's now also. They also become unavailable and console gives timeout so there is no way to debug. Logs don't show anything.
 
The issue seems to be more contained to Windows VPS which are acting quite unstable.

The VPS itself might be marked as "Started" but it seems to be like a ghost VPS - not accessible or manageable.
Setting "KVM Hardware Virtualization" to "No" seems to slightly help the issue and make the time outs not happen, but the VMs still act like they're overloaded and can't really be used.

Any ideas? Any and all help will be greatly appreciated.

Thank you.
 
Same problem here...
Stop
TASK ERROR: VM 102 qmp command 'system_reset' failed - unable to connect to VM 102 qmp socket - timeout after 31 retries

It happens with Debian and Windows VMs... And mostly when the backup runs. We backup to a remote PBS (proxmox backup server).
 
similar problem, no solution?

Code:
root@glsv-px-2:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

when i want to restart vm:

Снимок.JPG
only kill -9 pid of kvm process helps but this is not a solution ...
 
hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.
similar problem, no solution?

Code:
root@glsv-px-2:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

when i want to restart vm:

View attachment 24295
only kill -9 pid of kvm process helps but this is not a solution ...
hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.
 
hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.

hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.
here's what we found out:
when the Windows vm is running and the guest agent is enabled, but not installed on the system - at this moment, there are problems with access to sockets of agents of other virtual machines on the same host on which this Windows vm is running.
 
Dredging up this old thread, as I am also experiencing similar issues. Have been for over a year, and only on a specific node.

This node is unlike the rest of my hardware, with 48 cores, 96 threads (hyperthreading active), and ~700GB of memory.

Becuase this issue only appears to occurr on this single node, historically I've just assumed that it is a high CPU or high memory machine related issue.

In a similar vein, we ran into issues adding more than 30 drives to a single VM on this node, but that was awhile back.

Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-3
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:
qm start 130

Code:
# journalctl -xe

Sep 04 05:35:56 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:35:57 server pvestatd[2185365]: status update time (6.238 seconds)
Sep 04 05:36:06 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:06 server pvestatd[2185365]: status update time (6.215 seconds)
Sep 04 05:36:07 server pvedaemon[25173]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:16 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:16 server pvestatd[2185365]: status update time (6.219 seconds)
...
Sep 04 05:36:27 server pvedaemon[25175]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:29 server pmxcfs[24485]: [status] notice: received log
Sep 04 05:36:30 server pmxcfs[24485]: [status] notice: received log
Sep 04 05:36:36 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:37 server pvestatd[2185365]: status update time (6.202 seconds)

In my case, I do not have QEMU Guest agent running, and this is for an ubuntu VM.

This issue has historically been resolved temporarily by rebooting the node, but that's not really a great solution, and is only a temporary fix.
 
Last edited:
This sorta stuff pops up a lot on this node too:
Code:
400 Parameter verification failed. scsi10: hotplug problem - VM 130 qmp command 'query-pci' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries at /usr/share/perl5/PVE/API2/Qemu.pm line 1459. (500)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!