Error with backup (when backing up) - qmp command 'query-backup' failed - got wrong command id

t.lamprecht · Sep 9, 2021

FYI: There's a newer version of the pve-qemu-kvm package (6.0.0-4) available through the pvetest repository for Proxmox VE 7, it includes a proposed fix for a race found in the QMP stack of QEMU that would explain that issue happening occassionally under heavy load.

Transpalette · Sep 10, 2021

Last one (without 6.0.0-4), fail @ 44% ?! after 31% and 34%

Rawthink · Sep 11, 2021

t.lamprecht said:
FYI: There's a newer version of the pve-qemu-kvm package (6.0.0-4) available through the pvetest repository for Proxmox VE 7, it includes a proposed fix for a race found in the QMP stack of QEMU that would explain that issue happening occassionally under heavy load.

I updated to 6.0.0-4. It just broke a centOS VM.

INFO: resuming VM again
ERROR: VM 9327 qmp command 'cont' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 9327 failed - VM 9327 qmp command 'cont' failed - got timeout

The machine was not under a large load and it was the first backup of the set!

Detailed trace below!

VM_PID=541196
root@v1:~# gdb attach $VM_PID -ex='thread apply all bt' -ex='quit'
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 541196
[New LWP 541197]
[New LWP 541260]
[New LWP 541261]
[New LWP 541262]
[New LWP 541263]
[New LWP 541266]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb3508694f6 in __ppoll (fds=0x561d0fca0390, nfds=77, timeout=<optimized out>,
timeout@entry=0x7ffc4a71c370, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44 ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 7 (Thread 0x7fb33c9ff700 (LWP 541266) "kvm"):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x561d10376748) at ../sysdeps/nptl/futex-internal.h:186
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x561d10376758, cond=0x561d10376720) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x561d10376720, mutex=mutex@entry=0x561d10376758) at pthread_cond_wait.c:638
#3 0x0000561d0d95bdfb in qemu_cond_wait_impl (cond=0x561d10376720, mutex=0x561d10376758, file=0x561d0da3f6f2 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:174
#4 0x0000561d0d55bbc3 in vnc_worker_thread_loop (queue=0x561d10376720) at ../ui/vnc-jobs.c:248
#5 0x0000561d0d55c888 in vnc_worker_thread (arg=arg@entry=0x561d10376720) at ../ui/vnc-jobs.c:361
#6 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33c9fa3f0) at ../util/qemu-thread-posix.c:521
#7 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fb33effd700 (LWP 541263) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f21ffc0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f21ffc0) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f21ffc0) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33eff83f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fb33f7fe700 (LWP 541262) "kvm"):
--Type <RET> for more, q to quit, c to continue without paging--
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f2124f0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f2124f0) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f2124f0) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33f7f93f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fb33ffff700 (LWP 541261) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f203c80, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f203c80) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f203c80) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb33fffa3f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fb345a47700 (LWP 541260) "kvm"):
#0 0x00007fb35086acc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1 0x0000561d0d7e9ac7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561d0f1cde30, type=type@entry=44672) at ../accel/kvm/kvm-all.c:2630
#2 0x0000561d0d7e9c31 in kvm_cpu_exec (cpu=cpu@entry=0x561d0f1cde30) at ../accel/kvm/kvm-all.c:2467
#3 0x0000561d0d787dad in kvm_vcpu_thread_fn (arg=arg@entry=0x561d0f1cde30) at ../accel/kvm/kvm-accel-ops.c:49
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb345a423f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fb346349700 (LWP 541197) "kvm"):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000561d0d95c64a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-6.0.0/include/qemu/futex.h:29
#2 qemu_event_wait (ev=ev@entry=0x561d0deed148 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:460
#3 0x0000561d0d98680a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:258
#4 0x0000561d0d95b6b9 in qemu_thread_start (args=0x7fb3463443f0) at ../util/qemu-thread-posix.c:521
#5 0x00007fb350943ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fb350873def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fb3464b31c0 (LWP 541196) "kvm"):
#0 0x00007fb3508694f6 in __ppoll (fds=0x561d0fca0390, nfds=77, timeout=<optimized out>, timeout@entry=0x7ffc4a71c370, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1 0x0000561d0d95d521 in ppoll (__ss=0x0, __timeout=0x7ffc4a71c370, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1999362413) at ../util/qemu-timer.c:348
#3 0x0000561d0d9697e5 in os_host_main_loop_wait (timeout=1999362413) at ../util/main-loop.c:249
#4 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:530
#5 0x0000561d0d7b5b91 in qemu_main_loop () at ../softmmu/runstate.c:725
#6 0x0000561d0d476c0e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50
A debugging session is active.

Inferior 1 [process 541196] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/qemu-system-x86_64, process 541196
[Inferior 1 (process 541196) detached]

Stefan_R · Sep 13, 2021

Rawthink said:
ERROR: Backup of VM 9327 failed - VM 9327 qmp command 'cont' failed - got timeout

Hm, that seems to be a different error though. Did it work on repeat attempts, or is the error 100% reproducible? We've seen the "timeout on cont" error before too, but couldn't reproduce that one yet. The "got wrong command id" one at least *should* be fixed by the "pvetest" version.

Rawthink · Sep 13, 2021

Stefan_R said:
Hm, that seems to be a different error though. Did it work on repeat attempts, or is the error 100% reproducible? We've seen the "timeout on cont" error before too, but couldn't reproduce that one yet. The "got wrong command id" one at least *should* be fixed by the "pvetest" version.

It is not 100% reproducible. Some machines do it more often than others, but it could be Centos, Windows, Ubunut, etc

Sometimes for 2 days no errors happen, then the next day 2 or 3 happen...

It just happened on a CentOS VM and it broke it and I am performing a reboot now.

Taledo · Sep 14, 2021

Hello,

To add to this, we're also encountering the "qmp cont timeout" error. Seems perfectly random, but appears to be somewhat linked with CPU usage or the node activity. Had a job fail on a big VM constantly, and the workaround was to schedule the backup to be on its own, and not between two other backups. We've had, however, a crash on a relatively small VM this morning, meaning that the VM size on disk might not be the only culprit, contrary to what I suspected before.

I wonder if this can be replicated when stressing a node manually and attempting a backup.

EDIT : I tried doing a backup with a 100% CPU load on both the PBS & the PVE, all went well on a small VM (10G). Will try with a bigger one.

Rawthink · Sep 14, 2021

Taledo said:
Hello,

To add to this, we're also encountering the "qmp cont timeout" error. Seems perfectly random, but appears to be somewhat linked with CPU usage or the node activity. Had a job fail on a big VM constantly, and the workaround was to schedule the backup to be on its own, and not between two other backups. We've had, however, a crash on a relatively small VM this morning, meaning that the VM size on disk might not be the only culprit, contrary to what I suspected before.

I wonder if this can be replicated when stressing a node manually and attempting a backup.

EDIT : I tried doing a backup with a 100% CPU load on both the PBS & the PVE, all went well on a small VM (10G). Will try with a bigger one.

From my experience, I don't think the assumption that it is tied to high load is correct. I have done tons of huge VMs that had to start a new bitmap and they normally work. The most common problem will even occur on the first VM of a set under light load. Again, this started with Proxmox 7 and PMX Backup Server 2.0.

Taledo · Sep 14, 2021

I agree. I've tested with extreme synthetic load and the backup starts fine. This was a hypothesis because the load on our server was always a bit high.

Transpalette · Sep 17, 2021

Hi, no change with the last QEMU 6.0.0.4 update @ 09/16/21.
One server with One MW without any load during the backup.
Any idea ?
I have different host and cluster proxmox, only this one (1host/1pve/1vm/1pbs) produce this problem

t.lamprecht · Sep 17, 2021

Transpalette said:
One server with One MW without any load during the backup.
Any idea ?

You not only need to update the package but also fully restarted the VM(s) or live-migrate them to an already upgraded node, if in a cluster. As else the VM still runs with the old QEMU version and the fix can naturally not work.

Rawthink · Sep 23, 2021

The nodes and vms have been restarted to ensure they are on the latest version. I even have a node that is running the current test repository updates and I've run tests since that update yesterday and I am continuing to get qmp command 'cont' failed - got timeout errors that sometimes break the file system, forcing a reboot and sometimes a manual fsck.

Again this started with Proxmox 7 and was not an issue in 6.x

mishki · Sep 23, 2021

Rawthink said:
The nodes and vms have been restarted to ensure they are on the latest version. I even have a node that is running the current test repository updates and I've run tests since that update yesterday and I am continuing to get qmp command 'cont' failed - got timeout errors that sometimes break the file system, forcing a reboot and sometimes a manual fsck.

Again this started with Proxmox 7 and was not an issue in 6.x

+

I started experiencing problems with backups and 'qmp command 'cont' failed - got timeout' with one of my VMs (ubuntu 20.03.04, all updates, kernel 5.11.0-36, btrfs lvm filesystem, ms sql server 2019 is installed). It doesn't matter - with or without qemu agent.

Proxmox 7 - clean install on btrfs raid1. Previously, there were no such problems with this VM, even after the upgrade from 6 to 7.

Transpalette · Sep 25, 2021

Hi, i have solve my problem !
I have just upgrade my very low cpu (xeon 2407 2.2ghz no turbo) with a small but better 2450 2.9ghz !
buffer/timeout or something like that ?

mishki · Sep 28, 2021

mishki said:
+

I started experiencing problems with backups and 'qmp command 'cont' failed - got timeout' with one of my VMs (ubuntu 20.03.04, all updates, kernel 5.11.0-36, btrfs lvm filesystem, ms sql server 2019 is installed). It doesn't matter - with or without qemu agent.

Proxmox 7 - clean install on btrfs raid1. Previously, there were no such problems with this VM, even after the upgrade from 6 to 7.

did some experiments (PVE7 - all updates to date (pvetest):

1. Rebooted pve host (all VMs are turned off after reboot (there is no automatic VM power on)
2. apt reinstall qemu-server*
3. apt reinstall pve-qemu-kvm*
4. launched a backup (VM 1005, with which I am experiencing a backup problem):
vzdump 1005 --storage iomegapbs --mode snapshot --node theta --remove 0
5. A miracle happened - the backup was successful
6. turned on VM 1005
7. launched a backup, the backup was successful again

Let's see what happens next.

Taledo · Sep 29, 2021

Transpalette said:
Hi, i have solve my problem !
I have just upgrade my very low cpu (xeon 2407 2.2ghz no turbo) with a small but better 2450 2.9ghz !
buffer/timeout or something like that ?

View attachment 29788

We also run ancient AMD CPUs, so there could be a hardware issue.

I've never seen the issue on my home lab, which has better CPUs.

mishki · Oct 3, 2021

mishki said:
did some experiments (PVE7 - all updates to date (pvetest):

1. Rebooted pve host (all VMs are turned off after reboot (there is no automatic VM power on)
2. apt reinstall qemu-server*
3. apt reinstall pve-qemu-kvm*
4. launched a backup (VM 1005, with which I am experiencing a backup problem):
vzdump 1005 --storage iomegapbs --mode snapshot --node theta --remove 0
5. A miracle happened - the backup was successful
6. turned on VM 1005
7. launched a backup, the backup was successful again

Let's see what happens next.

After last updates to 5.11.22-5-pve # 1 SMP PVE 5.11.22-10 and reboot the host system caught the same error (qmp cont timeout).

I reinstalled pve-qemu-kvm / qemu-server again and the backups work again.

mishki · Oct 7, 2021

last updates to pve-kernel-5.13.14-1-pve & qemu-server: 7.0-16
reboot host, start backup VM (was stopped after reboot)

Bash:

INFO: starting new backup job: vzdump 1005 --node theta --storage iomegapbs --mode snapshot --remove 0
INFO: Starting Backup of VM 1005 (qemu)
INFO: Backup started at 2021-10-07 01:58:30
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: mssql-bh
INFO: include disk 'scsi0' 'local-btrfs:1005/vm-1005-disk-0.raw' 110G
INFO: creating Proxmox Backup Server archive 'vm/1005/2021-10-06T22:58:30Z'
INFO: starting kvm to execute backup task
INFO: started backup task 'df7de857-23fc-478f-b870-8e4a297fcc0a'
ERROR: VM 1005 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 1005 failed - VM 1005 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: Failed at 2021-10-07 01:58:51
INFO: Backup job finished with errors
TASK ERROR: job errors

Ok....

reboot host, apt reinstall qemu-server
start backup VM (was stopped after reboot) - OK

Bash:

INFO: starting new backup job: vzdump 1005 --node theta --storage iomegapbs --remove 0 --mode snapshot
INFO: Starting Backup of VM 1005 (qemu)
INFO: Backup started at 2021-10-07 02:12:58
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: mssql-bh
INFO: include disk 'scsi0' 'local-btrfs:1005/vm-1005-disk-0.raw' 110G
INFO: creating Proxmox Backup Server archive 'vm/1005/2021-10-06T23:12:58Z'
INFO: starting kvm to execute backup task
INFO: started backup task '800855c3-2185-4ad4-b7ce-33ae7d21684c'
INFO: scsi0: dirty-bitmap status: created new
INFO:   1% (1.3 GiB of 110.0 GiB) in 4s, read: 333.0 MiB/s, write: 75.0 MiB/s
INFO:   2% (3.0 GiB of 110.0 GiB) in 15s, read: 156.0 MiB/s, write: 69.1 MiB/s
INFO:   3% (3.4 GiB of 110.0 GiB) in 18s, read: 146.7 MiB/s, write: 109.3 MiB/s
...
...
...
INFO:  95% (104.6 GiB of 110.0 GiB) in 7m 37s, read: 146.0 MiB/s, write: 141.5 MiB/s
INFO: 100% (110.0 GiB of 110.0 GiB) in 7m 44s, read: 793.1 MiB/s, write: 110.3 MiB/s
INFO: backup is sparse: 15.02 GiB (13%) total zero data
INFO: backup was done incrementally, reused 81.07 GiB (73%)
INFO: transferred 110.00 GiB in 464 seconds (242.8 MiB/s)
INFO: stopping kvm after backup task
INFO: Finished Backup of VM 1005 (00:07:46)
INFO: Backup finished at 2021-10-07 02:20:44
Result: {
  "data": null
}
INFO: Backup job finished successfully
TASK OK

What means "Result: { "data": null }" I dont know..

Taledo · Nov 5, 2021

Good day,

Has there been any news from The Proxmox Team regarding this issue?

Cheers.

Fransuelo · Nov 23, 2021

Hello!

I have the same problem, but only one vm.

I will try a workaround and post the results. Sorry my bad english!

Regards,

Fransuêlo

Fransuelo · Nov 24, 2021

Hi everybody!

My workaround works fun!

For example, you move your hard drive to local drive and mark to delete source and then you return your hard drive to storage also marking delete source.

Error with backup (when backing up) - qmp command 'query-backup' failed - got wrong command id

Proxmox Staff Member

New Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Active Member

Renowned Member

Active Member

New Member

Proxmox Staff Member

Renowned Member

Well-Known Member

New Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Renowned Member

Renowned Member

We value your privacy