Error with backup (when backing up) - qmp command 'query-backup' failed - got wrong command id

FYI: There's a newer version of the pve-qemu-kvm package (6.0.0-4) available through the pvetest repository for Proxmox VE 7, it includes a proposed fix for a race found in the QMP stack of QEMU that would explain that issue happening occassionally under heavy load.
 
Last edited:
  • Like
Reactions: Moayad
FYI: There's a newer version of the pve-qemu-kvm package (6.0.0-4) available through the pvetest repository for Proxmox VE 7, it includes a proposed fix for a race found in the QMP stack of QEMU that would explain that issue happening occassionally under heavy load.
I updated to 6.0.0-4. It just broke a centOS VM.

INFO: resuming VM again
ERROR: VM 9327 qmp command 'cont' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 9327 failed - VM 9327 qmp command 'cont' failed - got timeout

The machine was not under a large load and it was the first backup of the set!

Detailed trace below!

1631367652087.png

VM_PID=541196
root@v1:~# gdb attach $VM_PID -ex='thread apply all bt' -ex='quit'
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 541196
[New LWP 541197]
[New LWP 541260]
[New LWP 541261]
[New LWP 541262]
[New LWP 541263]
[New LWP 541266]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb3508694f6 in __ppoll (fds=0x561d0fca0390, nfds=77, timeout=<optimized out>,
timeout@entry=0x7ffc4a71c370, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44 ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 7 (Thread 0x7fb33c9ff700 (LWP 541266) "kvm"):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x561d10376748) at ../sysdeps/nptl/futex-internal.h:186
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x561d10376758, cond=0x561d10376720) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x561d10376720, mutex=mutex@entry=0x561d10376758) at pthread_cond_wait.c:638
#3 0x0000561d0d95bdfb in qemu_cond_wait_impl (cond=0x561d10376720, mutex=0x561d10376758, file=0x561d0da3f6f2 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:174
#4 0x0000561d0d55bbc3 in vnc_worker_thread_loop (queue=0x561d10376720) at ../ui/vnc-jobs.c:248
#5 0x0000561d0d55c888 in vnc_worker_thread (arg=arg@entry=0x561d10376720) at ../ui/vnc-jobs.c:361
#6 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33c9fa3f0) at ../util/qemu-thread-posix.c:521
#7 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fb33effd700 (LWP 541263) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f21ffc0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f21ffc0) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f21ffc0) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33eff83f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fb33f7fe700 (LWP 541262) "kvm"):
--Type <RET> for more, q to quit, c to continue without paging--
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f2124f0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f2124f0) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f2124f0) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33f7f93f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fb33ffff700 (LWP 541261) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f203c80, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f203c80) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f203c80) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33fffa3f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fb345a47700 (LWP 541260) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f1cde30, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f1cde30) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f1cde30) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb345a423f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fb346349700 (LWP 541197) "kvm"):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000561d0d95c64a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-6.0.0/include/qemu/futex.h:29
#2 qemu_event_wait (ev=ev@entry=0x561d0deed148 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:460
#3 0x0000561d0d98680a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:258
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb3463443f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fb3464b31c0 (LWP 541196) "kvm"):
#0 0x00007fb3508694f6 in __ppoll (fds=0x561d0fca0390, nfds=77, timeout=<optimized out>, timeout@entry=0x7ffc4a71c370, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1 0x0000561d0d95d521 in ppoll (__ss=0x0, __timeout=0x7ffc4a71c370, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1999362413) at ../util/qemu-timer.c:348
#3 0x0000561d0d9697e5 in os_host_main_loop_wait (timeout=1999362413) at ../util/main-loop.c:249
#4 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:530
#5 0x0000561d0d7b5b91 in qemu_main_loop () at ../softmmu/runstate.c:725
#6 0x0000561d0d476c0e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50
A debugging session is active.

Inferior 1 [process 541196] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/qemu-system-x86_64, process 541196
[Inferior 1 (process 541196) detached]
 
Last edited:
ERROR: Backup of VM 9327 failed - VM 9327 qmp command 'cont' failed - got timeout
Hm, that seems to be a different error though. Did it work on repeat attempts, or is the error 100% reproducible? We've seen the "timeout on cont" error before too, but couldn't reproduce that one yet. The "got wrong command id" one at least *should* be fixed by the "pvetest" version.
 
Hm, that seems to be a different error though. Did it work on repeat attempts, or is the error 100% reproducible? We've seen the "timeout on cont" error before too, but couldn't reproduce that one yet. The "got wrong command id" one at least *should* be fixed by the "pvetest" version.
It is not 100% reproducible. Some machines do it more often than others, but it could be Centos, Windows, Ubunut, etc

Sometimes for 2 days no errors happen, then the next day 2 or 3 happen...

It just happened on a CentOS VM and it broke it and I am performing a reboot now.
 
Hello,

To add to this, we're also encountering the "qmp cont timeout" error. Seems perfectly random, but appears to be somewhat linked with CPU usage or the node activity. Had a job fail on a big VM constantly, and the workaround was to schedule the backup to be on its own, and not between two other backups. We've had, however, a crash on a relatively small VM this morning, meaning that the VM size on disk might not be the only culprit, contrary to what I suspected before.

I wonder if this can be replicated when stressing a node manually and attempting a backup.

EDIT : I tried doing a backup with a 100% CPU load on both the PBS & the PVE, all went well on a small VM (10G). Will try with a bigger one.
 
Last edited:
Hello,

To add to this, we're also encountering the "qmp cont timeout" error. Seems perfectly random, but appears to be somewhat linked with CPU usage or the node activity. Had a job fail on a big VM constantly, and the workaround was to schedule the backup to be on its own, and not between two other backups. We've had, however, a crash on a relatively small VM this morning, meaning that the VM size on disk might not be the only culprit, contrary to what I suspected before.

I wonder if this can be replicated when stressing a node manually and attempting a backup.

EDIT : I tried doing a backup with a 100% CPU load on both the PBS & the PVE, all went well on a small VM (10G). Will try with a bigger one.
From my experience, I don't think the assumption that it is tied to high load is correct. I have done tons of huge VMs that had to start a new bitmap and they normally work. The most common problem will even occur on the first VM of a set under light load. Again, this started with Proxmox 7 and PMX Backup Server 2.0.
 
I agree. I've tested with extreme synthetic load and the backup starts fine. This was a hypothesis because the load on our server was always a bit high.
 
Hi, no change with the last QEMU 6.0.0.4 update @ 09/16/21.
One server with One MW without any load during the backup.
Any idea ?
I have different host and cluster proxmox, only this one (1host/1pve/1vm/1pbs) produce this problem :(

1631867018725.png

1631867507348.png
 
Last edited:
One server with One MW without any load during the backup.
Any idea ?
You not only need to update the package but also fully restarted the VM(s) or live-migrate them to an already upgraded node, if in a cluster. As else the VM still runs with the old QEMU version and the fix can naturally not work.
 
The nodes and vms have been restarted to ensure they are on the latest version. I even have a node that is running the current test repository updates and I've run tests since that update yesterday and I am continuing to get qmp command 'cont' failed - got timeout errors that sometimes break the file system, forcing a reboot and sometimes a manual fsck.

Again this started with Proxmox 7 and was not an issue in 6.x
 
The nodes and vms have been restarted to ensure they are on the latest version. I even have a node that is running the current test repository updates and I've run tests since that update yesterday and I am continuing to get qmp command 'cont' failed - got timeout errors that sometimes break the file system, forcing a reboot and sometimes a manual fsck.

Again this started with Proxmox 7 and was not an issue in 6.x
+

I started experiencing problems with backups and 'qmp command 'cont' failed - got timeout' with one of my VMs (ubuntu 20.03.04, all updates, kernel 5.11.0-36, btrfs lvm filesystem, ms sql server 2019 is installed). It doesn't matter - with or without qemu agent.

Proxmox 7 - clean install on btrfs raid1. Previously, there were no such problems with this VM, even after the upgrade from 6 to 7.
 
Hi, i have solve my problem !
I have just upgrade my very low cpu (xeon 2407 2.2ghz no turbo) with a small but better 2450 2.9ghz !
buffer/timeout or something like that ?

1632567587150.png
 
+

I started experiencing problems with backups and 'qmp command 'cont' failed - got timeout' with one of my VMs (ubuntu 20.03.04, all updates, kernel 5.11.0-36, btrfs lvm filesystem, ms sql server 2019 is installed). It doesn't matter - with or without qemu agent.

Proxmox 7 - clean install on btrfs raid1. Previously, there were no such problems with this VM, even after the upgrade from 6 to 7.
did some experiments (PVE7 - all updates to date (pvetest):

1. Rebooted pve host (all VMs are turned off after reboot (there is no automatic VM power on)
2. apt reinstall qemu-server*
3. apt reinstall pve-qemu-kvm*
4. launched a backup (VM 1005, with which I am experiencing a backup problem):
vzdump 1005 --storage iomegapbs --mode snapshot --node theta --remove 0
5. A miracle happened - the backup was successful
6. turned on VM 1005
7. launched a backup, the backup was successful again

Let's see what happens next.
 
Last edited:
did some experiments (PVE7 - all updates to date (pvetest):

1. Rebooted pve host (all VMs are turned off after reboot (there is no automatic VM power on)
2. apt reinstall qemu-server*
3. apt reinstall pve-qemu-kvm*
4. launched a backup (VM 1005, with which I am experiencing a backup problem):
vzdump 1005 --storage iomegapbs --mode snapshot --node theta --remove 0
5. A miracle happened - the backup was successful
6. turned on VM 1005
7. launched a backup, the backup was successful again

Let's see what happens next.
After last updates to 5.11.22-5-pve # 1 SMP PVE 5.11.22-10 and reboot the host system caught the same error (qmp cont timeout).

I reinstalled pve-qemu-kvm / qemu-server again and the backups work again.
 
Last edited:
last updates to pve-kernel-5.13.14-1-pve & qemu-server: 7.0-16
reboot host, start backup VM (was stopped after reboot)

Bash:
INFO: starting new backup job: vzdump 1005 --node theta --storage iomegapbs --mode snapshot --remove 0
INFO: Starting Backup of VM 1005 (qemu)
INFO: Backup started at 2021-10-07 01:58:30
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: mssql-bh
INFO: include disk 'scsi0' 'local-btrfs:1005/vm-1005-disk-0.raw' 110G
INFO: creating Proxmox Backup Server archive 'vm/1005/2021-10-06T22:58:30Z'
INFO: starting kvm to execute backup task
INFO: started backup task 'df7de857-23fc-478f-b870-8e4a297fcc0a'
ERROR: VM 1005 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 1005 failed - VM 1005 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: Failed at 2021-10-07 01:58:51
INFO: Backup job finished with errors
TASK ERROR: job errors

Ok....

reboot host, apt reinstall qemu-server
start backup VM (was stopped after reboot) - OK

Bash:
INFO: starting new backup job: vzdump 1005 --node theta --storage iomegapbs --remove 0 --mode snapshot
INFO: Starting Backup of VM 1005 (qemu)
INFO: Backup started at 2021-10-07 02:12:58
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: mssql-bh
INFO: include disk 'scsi0' 'local-btrfs:1005/vm-1005-disk-0.raw' 110G
INFO: creating Proxmox Backup Server archive 'vm/1005/2021-10-06T23:12:58Z'
INFO: starting kvm to execute backup task
INFO: started backup task '800855c3-2185-4ad4-b7ce-33ae7d21684c'
INFO: scsi0: dirty-bitmap status: created new
INFO:   1% (1.3 GiB of 110.0 GiB) in 4s, read: 333.0 MiB/s, write: 75.0 MiB/s
INFO:   2% (3.0 GiB of 110.0 GiB) in 15s, read: 156.0 MiB/s, write: 69.1 MiB/s
INFO:   3% (3.4 GiB of 110.0 GiB) in 18s, read: 146.7 MiB/s, write: 109.3 MiB/s
...
...
...
INFO:  95% (104.6 GiB of 110.0 GiB) in 7m 37s, read: 146.0 MiB/s, write: 141.5 MiB/s
INFO: 100% (110.0 GiB of 110.0 GiB) in 7m 44s, read: 793.1 MiB/s, write: 110.3 MiB/s
INFO: backup is sparse: 15.02 GiB (13%) total zero data
INFO: backup was done incrementally, reused 81.07 GiB (73%)
INFO: transferred 110.00 GiB in 464 seconds (242.8 MiB/s)
INFO: stopping kvm after backup task
INFO: Finished Backup of VM 1005 (00:07:46)
INFO: Backup finished at 2021-10-07 02:20:44
Result: {
  "data": null
}
INFO: Backup job finished successfully
TASK OK

What means "Result: { "data": null }" I dont know..
 
Good day,

Has there been any news from The Proxmox Team regarding this issue?

Cheers.
 
Hello!

I have the same problem, but only one vm.

I will try a workaround and post the results. Sorry my bad english!

Regards,

Fransuêlo
 
Hi everybody!



My workaround works fun!

For example, you move your hard drive to local drive and mark to delete source and then you return your hard drive to storage also marking delete source.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!