PVE7 / PBS2 - Backup Timeout (qmp command 'cont' failed - got timeout)

adoII · Oct 27, 2021

I have changed

Code:

      } else {
            $timeout = 3; # default
to
      } else {
            $timeout = 8; # default

in line 134 of /usr/share/perl5/PVE/QMPClient.pm , restarted the pve daemons

Code:

for service in pvedaemon.service pveproxy.service pvestatd.service ;do
     echo "systemctl restart $service"
     systemctl restart $service
  done

and now the backups to proxmox backup server are working ....

nielsnl · Oct 27, 2021

adoII said:
I have changed

Code:

} else { $timeout = 3; # default to } else { $timeout = 8; # default

in line 134 of /usr/share/perl5/PVE/QMPClient.pm , restarted the pve daemons

Code:

for service in pvedaemon.service pveproxy.service pvestatd.service ;do echo "systemctl restart $service" systemctl restart $service done

and now the backups to proxmox backup server are working ....

Looks like this works for me as well. (I've set the timeout to 30 to be on the safe side.)

leen15 · Oct 28, 2021

I can confirm that with @adoll suggestion it works for me as well (30s timeout).

Let's hope that some proxmox Engineer will see this topic...

fabian · Oct 28, 2021

see https://bugzilla.proxmox.com/show_bug.cgi?id=3693

nielsnl · Oct 30, 2021

fabian said:
see https://bugzilla.proxmox.com/show_bug.cgi?id=3693

Their report suggests their issue is related to using NFS - which I guess may or may not be a coincidence. In any case, my issue was unrelated to NFS. I don't use it.

henrikh1998 · Oct 31, 2021

Not using PBS, but I can report that i had the same issue described here with NFS backup.
The fix from @adoII finally resolved my backup problems.

Maybe it should be a permanent change.. i can't think of drawback increasing the default timeout a bit

fabian · Nov 2, 2021

see the linked bug report, and the patch linked there

adoII · Nov 2, 2021

I applied the patch, now I get another error message.
Yes, the Backup Server at Hetzner is a little slower than my servers in my own datacenter. But it is not really slow.

Code:

()
INFO: starting new backup job: vzdump 104 --node zw-pm-1 --mode snapshot --storage backups21 --remove 0
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2021-11-02 10:06:22
INFO: status = running
INFO: VM Name: zwv05
INFO: include disk 'scsi0' 'local-btrfs:104/vm-104-disk-0.raw' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2021-11-02T09:06:22Z'
INFO: started backup task '30f2a01b-556b-4ec5-867c-ab0f37d1cc48'
INFO: resuming VM again
ERROR: VM 104 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 104 failed - VM 104 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: Failed at 2021-11-02 10:06:37
INFO: Backup job finished with errors
TASK ERROR: job errors

sztanpet · Nov 10, 2021

the problem still occures with the newest backup client/server, but now the error is

INFO: resuming VM again
ERROR: VM 103 qmp command 'query-pbs-bitmap-info' failed - got timeout

Code:

proxmox-backup-client-dbgsym/stable,stable,now 2.0.13-1 amd64 [installed]
proxmox-backup-client/stable,stable,now 2.0.13-1 amd64 [installed]
proxmox-backup-docs/stable,now 2.0.13-1 all [installed,automatic]
proxmox-backup-file-restore/stable,now 2.0.13-1 amd64 [installed,automatic]
proxmox-backup-restore-image/stable,now 0.3.1 amd64 [installed,automatic]
proxmox-backup-server-dbgsym/stable,now 2.0.13-1 amd64 [installed]
proxmox-backup-server/stable,now 2.0.13-1 amd64 [installed]

sztanpet · Nov 18, 2021

WIth the latest version (2.0.14-1) it now seems to be working for me (or at least for two days straight now).
EDIT: 4days and going

raistlinkell · Nov 20, 2021

Hello All
I upgraded my Proxmox cluster of 3 PVE nodes from v6.13 to v7.1.5, then the Proxmox Backup Server from v1.x to v2.0.14. I'm now seeing high numbers of VM's and CT's fail with their backups. Previous to the upgrades the number of backup failures was almost non-existant.

3 x PVE nodes; Promoxbox1, Proxmoxbox2, Proxmoxbox3. Running on the 3 nodes are 5 x VM's and 2 x CT's. Each PVE node hosts its own 6Tb ZFS Storage. There is also an NFS share which is used for non-critical and test VMs.

It would appear where any VM's & CT's have 1 or more storage disks on the NFS share, the backups are failing. The NFS share is an old Thecus NAS. NFS shares are defined on the cluster using the v6.13 default NFS type.

Code:

pveversion -v  (same for all 3 nodes)

proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.4: 6.4-7
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

copy of (truncated) logs from the 1 node below (I've added a little formatting to try and make it more readable)
Node Proxmoxbox3 hosts 1 VM only. This VM has storage on a NFS share

email subject: vzdump backup status (proxmoxbox3.xxxxxxxx.lan) : backup failed

Code:

vzdump --storage ProxBUP-14Tb --mailto systemadmin@xxxxxxxx.org --quiet 1 --pool TestVMs --mailnotification failure --mode snapshot

301: 2021-11-19 12:00:02 INFO: Starting Backup of VM 301 (qemu)
301: 2021-11-19 12:00:02 INFO: status = running
301: 2021-11-19 12:00:02 INFO: VM Name: UbuUPS
301: 2021-11-19 12:00:02 INFO: include disk 'scsi0' 'ThecusRAID5:301/vm-301-disk-0.raw' 12G
301: 2021-11-19 12:00:02 INFO: include disk 'scsi1' 'ThecusRAID5:301/vm-301-disk-1.raw' 12G
301: 2021-11-19 12:00:02 INFO: backup mode: snapshot
301: 2021-11-19 12:00:02 INFO: ionice priority: 7
301: 2021-11-19 12:00:02 INFO: creating Proxmox Backup Server archive 'vm/301/2021-11-19T04:00:02Z'
301: 2021-11-19 12:00:02 INFO: issuing guest-agent 'fs-freeze' command
301: 2021-11-19 12:00:04 INFO: issuing guest-agent 'fs-thaw' command
301: 2021-11-19 12:00:06 INFO: started backup task '088b0b0e-f952-4575-8978-44244530b3ab'
301: 2021-11-19 12:00:06 INFO: resuming VM again
301: 2021-11-19 12:00:13 ERROR: VM 301 qmp command 'query-pbs-bitmap-info' failed - got timeout
301: 2021-11-19 12:00:13 INFO: aborting backup job
301: 2021-11-19 12:00:21 INFO: resuming VM again
301: 2021-11-19 12:00:21 ERROR: Backup of VM 301 failed - VM 301 qmp command 'query-pbs-bitmap-info' failed - got timeout

All VM's and CT's with local ZFS storage all backed up successfully

If any other information is needed, I'll update ASAP!

raistlinkell · Nov 21, 2021

I've completed migrating all my VM and CT storage devices from the NFS share to the Local ZFS storage and all backups function without error.

adoII · Nov 22, 2021

I still have the query-pbs-bitmap-info error even with my timeout modification and the newest 2.0.14 packages.
My Backup Server is also not so slow, it is a 4 mirror 8 disk zfs raid 10 and even has zfs special devices on nvme.
Also the server ist idle during backup time and not doing anything else.
Backup failing is the one Problem. The second Problem is that sometimes the vms crash during the start of the backup because the backup client halts the IO on the VM for too long and so the vms are crashing...
Any ideas what I can try ? Is the problem known and will someone work on it ?
And yes, backup to local storage is no problem, it is only pbs server that is too slow for what proxmox backup clients expects.
The backup output is:

Code:

INFO: starting new backup job: vzdump 104 --storage backups21 --mode snapshot --node zw-pm-1 --remove 0
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2021-11-22 10:56:11
INFO: status = running
INFO: VM Name: xxxxx
INFO: include disk 'scsi0' 'local-btrfs:104/vm-104-disk-0.raw' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2021-11-22T09:56:11Z'
INFO: started backup task '125bc1c8-b078-4f83-9e64-ac7c9e29b587'
INFO: resuming VM again
ERROR: VM 104 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 104 failed - VM 104 qmp command 'query-pbs-bitmap-info' failed - got timeout
INFO: Failed at 2021-11-22 10:56:28
INFO: Backup job finished with errors
TASK ERROR: job errors

Taledo · Nov 24, 2021

I'm on the latest version available from the repos for both my PBS & my proxmox nodes, and I'm still getting errors :

Code:

NFO: include disk 'scsi0' 'local:20314/vm-20314-disk-0.qcow2' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/20314/2021-11-24T03:03:34Z'
ERROR: VM 20314 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 20314 qmp command 'backup-cancel' failed - unable to connect to VM 20314 qmp socket - timeout after 5980 retries
INFO: resuming VM again
ERROR: Backup of VM 20314 failed - VM 20314 qmp command 'cont' failed - unable to connect to VM 20314 qmp socket - timeout after 450 retries
INFO: Failed at 2021-11-24 04:16:24
INFO: Backup job finished with errors
TASK ERROR: job errors

The issue with that is that it froze the VM and I had to hard reboot it.

PVE version : pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)
PBS version : proxmox-backup-server 2.1.2-1 running version: 2.1.2

Both the PVEs & the PBS are running on dell servers with physical RAID. I am NOT running ZFS on raid, using EXT4 partitions.

I haven't encountered this issue on servers running ZFS directly, but there are other factors that could affect it (performance, networking...)

SDEc · Dec 2, 2021

Same issue here after upgrading from 6.3.x to 7.0.x (now running 7.1-6 and Backup Server 2.1-2), some backups are failing due to "timeout" issues

All Proxmox Server have FC NVMe Storage, Backup Server will write to NFS mount (QNAP)

Note: there was no issue at all while being on Proxmox 6.x with this setup

Code:

pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.11.22-7-pve)
pve-manager: 7.1-6 (running version: 7.1-6/4e61e21c)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-3
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Failure reported on Proxmox :

Code:

INFO: Starting Backup of VM 183 (qemu)
INFO: Backup started at 2021-12-02 03:00:27
INFO: status = running
INFO: VM Name: xxx-yyyy
INFO: include disk 'scsi0' 'datastore_pve01:vm-183-disk-0' 20G
INFO: include disk 'scsi1' 'datastore_pve01:vm-183-disk-1' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/183/2021-12-02T02:00:27Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 183 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 183 failed - VM 183 qmp command 'backup' failed - got timeout
INFO: Failed at 2021-12-02 03:04:02

Failure reported on PBS:

Code:

2021-12-02T03:00:31+01:00: starting new backup on datastore 'PVE1Backup': "vm/183/2021-12-02T02:00:27Z"
2021-12-02T03:00:31+01:00: download 'index.json.blob' from previous backup.
2021-12-02T03:03:29+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2021-12-02T03:03:29+01:00: download 'drive-scsi0.img.fidx' from previous backup.
2021-12-02T03:03:29+01:00: created new fixed index 1 ("vm/183/2021-12-02T02:00:27Z/drive-scsi0.img.fidx")
2021-12-02T03:03:43+01:00: register chunks in 'drive-scsi1.img.fidx' from previous backup.
2021-12-02T03:03:43+01:00: download 'drive-scsi1.img.fidx' from previous backup.
2021-12-02T03:03:50+01:00: created new fixed index 2 ("vm/183/2021-12-02T02:00:27Z/drive-scsi1.img.fidx")
2021-12-02T03:04:02+01:00: add blob "/mnt/nfs/pve1/vm/183/2021-12-02T02:00:27Z/qemu-server.conf.blob" (302 bytes, comp: 302)
2021-12-02T03:04:02+01:00: backup ended and finish failed: backup ended but finished flag is not set.
2021-12-02T03:04:02+01:00: removing unfinished backup
2021-12-02T03:04:02+01:00: TASK ERROR: removing backup snapshot "/mnt/nfs/pve1/vm/183/2021-12-02T02:00:27Z" failed - Directory not empty (os error 39)

(what was it waiting for bettween 03:00:31 and 03:03:29 ? - I assume this is the "timeout" ~180 sec / on QNAP the above directory has been created and is empty)

from syslog

Code:

.
.
Dec  2 03:03:50 pvebackup proxmox-backup-proxy[672]: created new fixed index 2 ("vm/183/2021-12-02T02:00:27Z/drive-scsi1.img.fidx")
Dec  2 03:04:02 pvebackup proxmox-backup-proxy[672]: add blob "/mnt/nfs/pve1/vm/183/2021-12-02T02:00:27Z/qemu-server.conf.blob" (302 bytes, comp: 302)
Dec  2 03:04:02 pvebackup proxmox-backup-proxy[672]: backup ended and finish failed: backup ended but finished flag is not set.
Dec  2 03:04:02 pvebackup proxmox-backup-proxy[672]: removing unfinished backup
Dec  2 03:04:02 pvebackup proxmox-backup-proxy[672]: removing backup snapshot "/mnt/nfs/pve1/vm/183/2021-12-02T02:00:27Z"
Dec  2 03:04:02 pvebackup proxmox-backup-proxy[672]: TASK ERROR: removing backup snapshot "/mnt/nfs/pve1/vm/183/2021-12-02T02:00:27Z" failed - Directory not empty (os error 39)

adoII · Dec 3, 2021

I confirm, still the same problem qmp command 'query-pbs-bitmap-info' failed - got timeout also here ...

felipe · Dec 9, 2021

should the error : "qmp command 'cont' failed - got timeout" be fixed by now. (in proxmox 7.1) ?
i know it occurs only on high loaded storage backends. but that was not a problem in proxmox 6.4

ioanv · Jan 3, 2022

Bug still present with latest version.
NSF storage on synology drive.
Increasing from 3 to 8 seconds did NOT solve the problem.
Increasing to 30 seconds DID solve the problem.

Kernel Version Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)

PVE Manager Version pve-manager/7.1-8/5b267f33

sztanpet · Jan 19, 2022

This has also re-occurred for me, but now with the "qmp command 'query-pbs-bitmap-info' failed - got timeout" error.
I noticed that when it occurs the iowait is through the roof and it shouldn't be since the target drive is on nvme. The culprit based on pidstat -dl 5
are arc_prune kernel threads and googling that took me to https://github.com/openzfs/zfs/issues/6223
so in my case the root cause is a zfs issue, and it causes long timeouts that pbs isnt expecting.

Taledo · Feb 7, 2022

I'm also seeing the "query-pbs-bitmap-info" issue now.

We're not running ZFS so that can't be our issue.

The PVE Is running on 5.11.22-3-pve

PVE7 / PBS2 - Backup Timeout (qmp command 'cont' failed - got timeout)

Renowned Member

New Member

Member

Proxmox Staff Member

New Member

Member

Proxmox Staff Member

Renowned Member

Member

Member

Member

Member

Renowned Member

Active Member

New Member

Renowned Member

Well-Known Member

Well-Known Member

Member

Active Member

We value your privacy