QMP Socket timeout on console, backups etc... randomly

Jema · Feb 18, 2020

We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.

We run all latest updates and run an NVME Ceph cluster.
Some VM's (KVM) work fine and console is accessible, but some do not on exactly the same hypervisor. It has nothing to do with the OS being installed on it, as we noticed this issue currently on both Linux and Windows.

Does anyone have any clue how to debug this and find the cause?

Starting a VM sometimes doesn't work either:

Code:

TASK ERROR: start failed: command '/usr/bin/kvm -id 192 -name telegram.rare.com -chardev 'socket,id=qmp,path=/var/run/qemu-server/192.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/192.pid -daemonize -smbios 'type=1,uuid=c4e1b094-d09c-49df-bae7-fa2c64fb848f' -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/192.vnc,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,enforce' -m 4000 -object 'memory-backend-ram,id=ram-node0,size=4000M' -numa 'node,nodeid=0,cpus=0-1,memdev=ram-node0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:9617bc5c7589' -drive 'file=rbd:nvme01/vm-192-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/nvme01.keyring,if=none,id=drive-ide0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap192i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=1A:3A:22:6F:3A:FB,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout

Trying to access the console also gives the timeout:

Code:

()
VM 155 qmp command 'change' failed - unable to connect to VM 155 qmp socket - timeout after 599 retries
TASK ERROR: Failed to run vncproxy.

Backups also fail randomly:

Code:

VMID    NAME    STATUS    TIME    SIZE    FILENAME
112    my.server1.com    OK    00:00:54    1.46GB    /mnt/pve/hyp08-backup/dump/vzdump-qemu-112-2020_02_18-05_00_02.vma.lzo
117    my.server2.com    FAILED    00:00:10    got timeout
121    my.server3.com    FAILED    00:00:13    got timeout
126    my.server4.com    OK    00:03:13    16.46GB    /mnt/pve/hyp08-backup/dump/vzdump-qemu-126-2020_02_18-05_01_19.vma.lzo

wolfgang · Feb 19, 2020

Hi,

do you see any Errors in the log?

t.lamprecht · Feb 19, 2020

Jema said:
We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.

so this does not just happen during backups but also once no backup job is running? Is the backup target a NFS?

We got some fixes to for QMP timeouts with improved locking in pve-qemu-kvm 4.1.1-3 it's available through our pvetest repo as of just now. https://pve.proxmox.com/wiki/Package_Repositories

Jema · Feb 19, 2020

Backups time out on some VM's but succeed on others. It happens also in the GUI when trying to open the console for example. Copying the VM over to another hypervisor also won't work, we have to take the VM down and then the copy succeeds. Only on the new hypervisor the same behavior occurs.

If we re-create the VM or restore a backup, everything works again, but this is no solution.

Which log files should I be checking exactly?

@t.lamprecht

I will update the pve-qemu-kvm version and update you if it solves anything.

update: I have updated the package, restarted the VM and console works now. So this seems to solve it!

Jema · Feb 19, 2020

@t.lamprecht

Unfortunately I was excited too quickly. It worked ok on 1 VM, but just tested another and it keeps on timing out on it, and I'm referring to the Console display. It keeps on "Connecting..." and there is no further display.

Code:

()
VM 192 qmp command 'change' failed - unable to connect to VM 192 qmp socket - timeout after 598 retries
TASK ERROR: Failed to run vncproxy.

Jema · Feb 20, 2020

Does anyone have any clue please? We experience this on multiple hypervisors.

Jema · Feb 24, 2020

We still don't have a solution and it starts to happen randomly on other VM's now also. They also become unavailable and console gives timeout so there is no way to debug. Logs don't show anything.

Mor H. · Feb 24, 2020

The issue seems to be more contained to Windows VPS which are acting quite unstable.

The VPS itself might be marked as "Started" but it seems to be like a ghost VPS - not accessible or manageable.
Setting "KVM Hardware Virtualization" to "No" seems to slightly help the issue and make the time outs not happen, but the VMs still act like they're overloaded and can't really be used.

Any ideas? Any and all help will be greatly appreciated.

Thank you.

Tom7320 · Jul 5, 2020

Same problem here. Started after the update from PVE 6.1 to 6.2.

Regards

Swifty.hu · Jan 23, 2021

Same problem here...

Stop
TASK ERROR: VM 102 qmp command 'system_reset' failed - unable to connect to VM 102 qmp socket - timeout after 31 retries

It happens with Debian and Windows VMs... And mostly when the backup runs. We backup to a remote PBS (proxmox backup server).

sid777 · Mar 9, 2021

similar problem, no solution?

Code:

root@glsv-px-2:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

when i want to restart vm:

View attachment 24295
only kill -9 pid of kvm process helps but this is not a solution ...

hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.

sid777 · Mar 9, 2021

sid777 said:
hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.

hmm, interestingly, this behavior is observed in Linux guests when a Windows machine is started on the same host with an incorrectly set OS type. The Windows VM also behaves the same way - it freezes and timeouts for any action in the GUI Proxmox. By disabling the guest's Windows, the problem disappeared. I'll try to set the correct type of wasps and run - I'll watch.

here's what we found out:
when the Windows vm is running and the guest agent is enabled, but not installed on the system - at this moment, there are problems with access to sockets of agents of other virtual machines on the same host on which this Windows vm is running.

jw6677 · Sep 4, 2021

Dredging up this old thread, as I am also experiencing similar issues. Have been for over a year, and only on a specific node.

This node is unlike the rest of my hardware, with 48 cores, 96 threads (hyperthreading active), and ~700GB of memory.

Becuase this issue only appears to occurr on this single node, historically I've just assumed that it is a high CPU or high memory machine related issue.

In a similar vein, we ran into issues adding more than 30 drives to a single VM on this node, but that was awhile back.

Code:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-3
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:

qm start 130

Code:

# journalctl -xe

Sep 04 05:35:56 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:35:57 server pvestatd[2185365]: status update time (6.238 seconds)
Sep 04 05:36:06 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:06 server pvestatd[2185365]: status update time (6.215 seconds)
Sep 04 05:36:07 server pvedaemon[25173]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:16 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:16 server pvestatd[2185365]: status update time (6.219 seconds)
...
Sep 04 05:36:27 server pvedaemon[25175]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:29 server pmxcfs[24485]: [status] notice: received log
Sep 04 05:36:30 server pmxcfs[24485]: [status] notice: received log
Sep 04 05:36:36 server pvestatd[2185365]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Sep 04 05:36:37 server pvestatd[2185365]: status update time (6.202 seconds)

In my case, I do not have QEMU Guest agent running, and this is for an ubuntu VM.

This issue has historically been resolved temporarily by rebooting the node, but that's not really a great solution, and is only a temporary fix.

jw6677 · Sep 4, 2021

This sorta stuff pops up a lot on this node too:

Code:

400 Parameter verification failed. scsi10: hotplug problem - VM 130 qmp command 'query-pci' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries at /usr/share/perl5/PVE/API2/Qemu.pm line 1459. (500)

jw6677 · Sep 4, 2021

Probably Relevant:
https://forum.proxmox.com/threads/h...st-startup-hangs-and-whole-pc-restarts.86593/
https://forum.proxmox.com/threads/w...vm-100-socket-timeout-after-31-retries.20350/
https://forum.proxmox.com/threads/c...s-stop-working-with-qmp-command-failed.86343/
https://forum.proxmox.com/threads/certain-vms-from-a-cluster-cannot-be-backed-up-and-managed.57016/
https://forum.proxmox.com/threads/proxmox-server-aus-nach-backup.94532/post-411413
https://forum.proxmox.com/threads/l...chnical-expertise-with-gpu-passthrough.94333/
https://forum.proxmox.com/threads/vm-hanging-freeze-after-update-from-pve-6-to-7.93139/

jw6677 · Sep 4, 2021

Very Probably Relevant:
https://pve-devel.pve.proxmox.narkive.com/DOctgOiR/regression-with-latest-qemu-and-iothreads-option

Going to test disabled iothread options

jw6677 · Sep 28, 2021

~1 month bump

Search

Search

QMP Socket timeout on console, backups etc... randomly

Jema

New Member

wolfgang

Proxmox Retired Staff

t.lamprecht

Proxmox Staff Member

Jema

New Member

Jema

New Member

Jema

New Member

Jema

New Member

Mor H.

New Member

Tom7320

Member

Swifty.hu

Active Member

sid777

New Member

sid777

New Member

sid777

New Member

jw6677

Active Member

jw6677

Active Member

jw6677

Active Member

jw6677

Active Member

jw6677

Active Member

We value your privacy