Critical: qemu-server / qmp issues - VMs stop working with: qmp command [...] failed

optymale

New Member
Dec 15, 2020
7
0
1
28
When VMs are backed up by PBS some of them often end up getting qmp errors.

All VMs with qmp errors stop working sooner or later. Graceful reboot/shutdown and console is not working anymore. The only solution is hard power off, boot & migrate.

Log:
Code:
Mar 22 00:53:08 2670-010 pvedaemon[3114999]: VM 170 qmp command failed - VM 170 qmp command 'query-machines' failed - unable to connect to VM 170 qmp socket - timeout after 31 retries
Mar 22 00:53:15 2670-010 pvestatd[2203]: VM 170 qmp command failed - VM 170 qmp command 'query-proxmox-support' failed - unable to connect to VM 170 qmp socket - timeout after 31 retries
Mar 22 00:53:16 2670-010 pvedaemon[3115109]: VM 170 qmp command failed - VM 170 qmp command 'quit' failed - unable to connect to VM 170 qmp socket - timeout after 31 retries
Mar 22 00:53:18 2670-010 pvedaemon[2972890]: VM 170 qmp command failed - VM 170 qmp command 'query-proxmox-support' failed - unable to connect to VM 170 qmp socket - timeout after 31 retries
Mar 22 00:53:25 2670-010 pvestatd[2203]: VM 170 qmp command failed - VM 170 qmp command 'query-proxmox-support' failed - unable to connect to VM 170 qmp socket - timeout after 31 retries

This is extremely critical. Similar problems did occur on previous versions, always mentioned being fixed, see here:

https://forum.proxmox.com/threads/w...vm-100-socket-timeout-after-31-retries.20350/

https://forum.proxmox.com/threads/error-vm-103-qmp-command-query-backup-failed-got-timeout.76554/

https://forum.proxmox.com/threads/certain-vms-from-a-cluster-cannot-be-backed-up-and-managed.57016/

A real, long term bugfix is highly necessary, as VMs keep on crashing randomly at night and can't be migrated by HA because of the same problem.

There can't be provided a working virtualization Cluster with this setup with Proxmox currently. This problem occurs in different autonomous clusters sporadically and unpredictably.

Bash:
root@pvenode1:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.98-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
pve-kernel-helper: 6.3-5
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-1
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve1


Information to cluster/node:

3-Node-Cluster with CEPH and exclusively qemu/KVM VMs all on CEPH.

Cluster backed up daily at night by local Proxmox Backup and additionally by Proxmox Backup Server.

Backups don't interfere with each other.

Most VMs with enabled and installed qemu guest agent.
 
Last edited:
There can't be provided a working virtualization Cluster with this setup with Proxmox currently

Just run the recommended packages from the enterprise repository and you are not affected from this issue.

But yes, the issue is found and you see a fix soon.
 
Just run the recommended packages from the enterprise repository and you are not affected from this issue.

But yes, the issue is found and you see a fix soon.

@tom
If I buy a subscription now and use the enterprise repository right away, will the bug be solved immediately after updating? Or do I have to wait for the next minor/major update?
 
apt will not downgrade packages, so you have to downgrade manually.
 
Same problem with 6.4:

May 31 09:31:13 pm-cal-56-02 pvestatd[3589]: VM 134 qmp command failed - VM 134 qmp command 'query-proxmox-support' failed - unable to connect to VM 134 qmp socket - timeout after 31 retries
May 31 09:31:16 pm-cal-56-02 pvestatd[3589]: VM 138 qmp command failed - VM 138 qmp command 'query-proxmox-support' failed - got timeout
May 31 09:31:19 pm-cal-56-02 pvestatd[3589]: status update time (13.761 seconds)
May 31 09:31:25 pm-cal-56-02 pvestatd[3589]: VM 138 qmp command failed - VM 138 qmp command 'query-proxmox-support' failed - unable to connect to VM 138 qmp socket - timeout after 31 retries
May 31 09:31:28 pm-cal-56-02 pvestatd[3589]: VM 134 qmp command failed - VM 134 qmp command 'query-proxmox-support' failed - unable to connect to VM 134 qmp socket - timeout after 31 retries
May 31 09:31:33 pm-cal-56-02 pvestatd[3589]: got timeout
May 31 09:31:33 pm-cal-56-02 pvestatd[3589]: status update time (14.259 seconds)
May 31 09:31:42 pm-cal-56-02 pvestatd[3589]: VM 138 qmp command failed - VM 138 qmp command 'query-proxmox-support' failed - got timeout
May 31 09:31:45 pm-cal-56-02 pvestatd[3589]: VM 134 qmp command failed - VM 134 qmp command 'query-proxmox-support' failed - unable to connect to VM 134 qmp socket - timeout after 31 retries
May 31 09:31:45 pm-cal-56-02 pvestatd[3589]: status update time (12.259 seconds)
May 31 09:31:54 pm-cal-56-02 pvestatd[3589]: VM 138 qmp command failed - VM 138 qmp command 'query-proxmox-support' failed - unable to connect to VM 138 qmp socket - timeout after 31 retries
May 31 09:31:57 pm-cal-56-02 pvestatd[3589]: VM 134 qmp command failed - VM 134 qmp command 'query-proxmox-support' failed - unable to connect to VM 134 qmp socket - timeout after 31 retries
May 31 09:32:00 pm-cal-56-02 systemd[1]: Starting Proxmox VE replication runner...
May 31 09:32:01 pm-cal-56-02 systemd[1]: pvesr.service: Succeeded.
May 31 09:32:01 pm-cal-56-02 systemd[1]: Started Proxmox VE replication runner.
May 31 09:32:06 pm-cal-56-02 pvestatd[3589]: status update time (20.287 seconds)
May 31 09:32:12 pm-cal-56-02 pvestatd[3589]: VM 134 qmp command failed - VM 134 qmp command 'query-proxmox-support' failed - unable to connect to VM 134 qmp socket - timeout after 31 retries

root@pm-cal-56-02:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.20-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-1
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
 
Same here on 6.4, we can't move vm between nodes or from local to shared storage... :(



Jun 21 14:17:54 cloud09 pvestatd[2481]: VM 303 qmp command failed - VM 303 qmp command 'query-proxmox-support' failed - got timeout
Jun 21 14:17:54 cloud09 pvedaemon[28078]: VM 303 qmp command failed - VM 303 qmp command 'query-block-jobs' failed - got wrong command id '2481:485164' (expected 28078:1251)
Jun 21 14:17:56 cloud09 pvestatd[2481]: status update time (7.549 seconds)
Jun 21 14:18:00 cloud09 systemd[1]: Starting Proxmox VE replication runner...
Jun 21 14:18:01 cloud09 systemd[1]: pvesr.service: Succeeded.
Jun 21 14:18:01 cloud09 systemd[1]: Started Proxmox VE replication runner.
Jun 21 14:18:16 cloud09 pmxcfs[2344]: [status] notice: received log
Jun 21 14:18:42 cloud09 pvedaemon[28078]: storage migration failed: block job (mirror) error: VM 303 qmp command 'query-block-jobs' failed - got wrong command id '2481:485164' (expected 28078:1251)
migration failed: block job (mirror) error: VM 303 qmp command 'query-block-jobs' failed - got wrong command id '2481:485164' (expected 28078:1251)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!