Intermittent PVE → PBS Backup Timeouts Despite Successful Completion on PBS

May 23, 2025
2
0
1
Hello experts!

We’re experiencing an intermittent and unpredictable issue with backups in our Proxmox cluster, and I’m hoping the community can help us understand what's going wrong — or advise on better tuning and scheduling.
:gear:
Setup Overview:

  • PVE Cluster: ~30 nodes
  • PBS Server:1 server (bare metal)
    • OS Disk: SSD
    • Datastore: HDD with 1x SSD configured as a special device for metadata
  • Backup mode: snapshot
  • Verification: Enabled after backups
  • Garbage Collection: Managed using a-time, runs regularly
  • Storage: Ceph-backed VM disks
  • Concurrency: Backups run nightly via scheduled jobs across nodes
:chart_with_downwards_trend:
Problem Summary:

  • PVE backup jobs occasionally fail with a timeout, but the corresponding backup actually completes successfully on PBS minutes (or even hours) later.
  • The failure message is usually (vm id's are for identification of the attachments):
ERROR: VM 1514 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id
  • PBS log shows backup fully streamed and finalized.
VM 3462 (~2 TiB):
  • Similar pattern — PVE marked it failed due to timeout, PBS has a completed and verifiable backup.
:brain:
Observations and Suspicions:

  • PBS server I/O loadmay be the root issue:
    • We suspect backup verification running after backup job may increase PBS load and cause response delays back to PVE.
    • Garbage collection may also be compounding the issue if overlapping.
  • PVE backup timeout seems to occur at the archive creation point, not during the bulk of the data transfer.
:jigsaw:
Questions for the Community:

  1. Has anyone else seen this pattern — timeouts at archive finalization but successful backups on PBS?
  2. Any way to increase the QMP timeout for the backup operation in PVE?
  3. Are there known best practices for scheduling verification and GC to avoid overlap with backups?
  4. Is there a recommended I/O monitoring approach to confirm PBS load at time of failure?
I have attached logs for the pve and pbs tasks.
I’d appreciate any input, tuning advice, or even confirmation that this is a known limitation or bug we should account for.
 

Attachments

Hi,
please share the full task logs for the backup jobs in PVE and also post the output of qm config <VMID> --current with the respective VMIDs as well as cat /etc/pve/storage.cfg. From your comments and log outputs it doesn't seem the PBS is the issue here.
 
Thanks for the response.
Sorry I thought I attached the pve task logs, but here they are.

qm config 1514 --current

autostart: 1
boot: dcn
bootdisk: sata0
cores: 12
ipconfig0: gw=xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx
memory: 360448
name: TOPSQL01
net0: virtio=xx:xx:xx:xx,bridge=vmbr0,firewall=1,tag=103
onboot: 1
ostype: win10
sata0: vmdata-32K-2:vm-1514-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=250G
sata1: vmdata-32K-2:vm-1514-disk-1,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=190G
sata2: vmdata-32K-2:vm-1514-disk-4,iops_rd=5000,iops_wr=5000,mbps_rd=250,mbps_wr=250,size=350G
sata3: vmdata-32K-2:vm-1514-disk-2,iops_rd=6000,iops_wr=6000,mbps_rd=250,mbps_wr=250,size=4000G
sata4: vmdata-32K:vm-1514-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=1000G
sata5: vmdata-32K-2:vm-1514-disk-3,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=1000G
scsihw: virtio-scsi-pci
smbios1: uuid=68ff8667-60e7-464c-9dad-fed87c8f2cfa
vmgenid: e510a01e-22f8-42d6-a066-d09dfc994325



qm config 3462 --current

agent: 1
bootdisk: scsi0
cores: 16
ide2: pve-cloudinit:3462/vm-3462-cloudinit.qcow2,media=cdrom
memory: 49152
name: TRLS-FUSION-01
net0: virtio=xx:xx:xx:xx,bridge=vmbr0,firewall=1,tag=2401
numa: 0
onboot: 1
ostype: win8
scsi0: vmdata-16K-3:vm-3462-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=80,mbps_wr=80,size=512G
scsi1: vmdata-32K-2:vm-3462-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=80,mbps_wr=80,size=2000G
scsihw: virtio-scsi-pci
smbios1: uuid=c60ffa5d-df89-491e-b5fd-8de91819a388
sockets: 1
vmgenid: 05085b06-6d59-4d45-8f9e-44d39312e265



cat /etc/pve/storage.cfg


dir: local
path /var/lib/vz
content iso,vztmpl,backup

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

zfspool: vmdata-16K
pool rpool/vmdata-16K
blocksize 16k
content rootdir,images
sparse 1

zfspool: vmdata-32K
pool rpool/vmdata-32K
blocksize 32k
content images,rootdir
sparse 1

zfspool: vmdata-16K-2
pool rpool-2/vmdata-16K-2
blocksize 16k
content images,rootdir
mountpoint /rpool-2/vmdata-16K-2
nodes pve-110-XXX,
pve-301-XXX,
pve-127-XXX,
pve-119-XXX,
pve-129-XXX,
pve-115-XXX,
pve-113-XXX,
pve-restore-XXX,
pve-102-XXX,
pve-22-XXX
,pve-119-XXX,
pve-124-XXX,
pve-111-XXX,
pve-105-XXX,
pve-23-XXX,
pve-16-XXX,
c12-XXX,
pve-103-XXX,
pve-128-XXX,
pve-112-XXX,
pve-125-XXX,
pve-120-XXX,
pve-109-XXX,
pve-108-XXX,
pve-126-XXX,
pve-107-XXX,
pve-114-XXX,
pve-104-XXX,
pve-106-XXX,
pve-15-XXX,
pve-12-XXX,
pve-101-XXX
sparse 1

zfspool: vmdata-16K-3
pool rpool-3/vmdata-16K-3
blocksize 16k
content rootdir,images
mountpoint /rpool-3/vmdata-16K-3
nodes c12-XXX,pve-15-XXX
sparse 1

zfspool: vmdata-32K-2
pool rpool-2/vmdata-32K-2
blocksize 32k
content images,rootdir
mountpoint /rpool-2/vmdata-32K-2
nodes pve-127-XXX,
pve-119-XXX,
pve-129-XXX,
pve-115-XXX,
pve-113-XXX,
pve-110-XXX,
pve-301-XXX,
pve-23-XXX,
pve-16-XXX,
c12-XXX,
pve-103-XXX,
pve-restore-XXX,
pve-22-XXX,
pve-119-XXX,
pve-124-XXX,
pve-102-XXX,
pve-111-XXX,
pve-105-XXX,
pve-109-XXX,
pve-108-XXX,
pve-120-XXX,
pve-107-XXX,
pve-126-XXX,
pve-114-XXX,
pve-128-XXX,
pve-112-XXX,
pve-125-XXX,
pve-101-XXX,
pve-104-XXX,
pve-106-XXX,
pve-15-XXX
sparse 1

glusterfs: images
path /mnt/pve/images
volume images
content images,iso
prune-backups keep-all=1
server xxx.xxx.xxx.xxx
server2 xxx.xxx.xxx.xxx

pbs: PBS01-Datastore
datastore PBS01-Datastore
server xxx.xxx.xxx.xxx
content backup
fingerprint *****
prune-backups keep-all=1
username ***

zfspool: vmdata-32K-3
pool rpool-3/vmdata-32K-3
blocksize 32k
content images,rootdir
mountpoint /rpool-3/vmdata-32K-3
nodes pve-115-XXX, c12-XXX, pve-124-XXX
sparse 1

dir: cloudinit-local
path ***
content images
prune-backups keep-all=1
shared 1

glusterfs: pve-cloudinit
path ***
volume vol_main
content images,iso
prune-backups keep-all=1
server xxx.xxx.xxx.xxx
server2 xxx.xxx.xxx.xxx

zfspool: rpool-3
pool rpool-3
content images,rootdir
mountpoint /rpool-3
nodes pve-124-XXX, pve-115-XXX
sparse 1

zfspool: rpool-2
pool rpool-2
content images,rootdir
mountpoint /rpool-2
nodes pve-124-XXX, pve-126-XXX, pve-125-XXX

rbd: ceph-replicated-3-dh1-oadc-jhb1
content rootdir,images
krbd 0
nodes pve-125-XXX, pve-126-XXX, pve-124-XXX
pool ceph-replicated-XXXX

cifs: ca-pve-images
path ****
server xxx.xxx.xxx.xxx
share IMAGES-SMB
content iso,images
prune-backups keep-all=1
username ***

cifs: ca-pve-cloudinit
path ***
server xxx.xxx.xxx.xxx
share CLOUDINIT-SMB
content iso,images
prune-backups keep-all=1
username ***



pve task logs:


INFO: trying to get global lock - waiting...
INFO: got global lock
INFO: starting new backup job: vzdump 3462 --compress 1 --remove 0 --storage PBS01-Datastore --lockwait 2880 --prune-backups 'keep-last=99999' --node pve-15-XXX
INFO: Starting Backup of VM 3462 (qemu)
INFO: Backup started at 2025-05-30 00:17:56
INFO: status = running
INFO: VM Name: TRLS-FUSION-01
INFO: include disk 'scsi0' 'vmdata-16K-3:vm-3462-disk-0' 512G
INFO: include disk 'scsi1' 'vmdata-32K-2:vm-3462-disk-0' 2000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/3462/2025-05-29T22:17:50Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 3462 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 3462 qmp command 'backup-cancel' failed - got wrong command id '91418:6' (expected 91418:7)
INFO: resuming VM again
ERROR: Backup of VM 3462 failed - VM 3462 qmp command 'backup' failed - got timeout
INFO: Failed at 2025-05-30 00:21:52
INFO: Backup job finished with errors
TASK ERROR: job errors



INFO: trying to get global lock - waiting...
INFO: got global lock
INFO: starting new backup job: vzdump 1514 --lockwait 2880 --storage PBS01-Datastore --remove 0 --node pve-16-XXX --prune-backups 'keep-last=99999' --compress 1
INFO: Starting Backup of VM 1514 (qemu)
INFO: Backup started at 2025-05-22 00:18:37
INFO: status = running
INFO: VM Name: TOPSQL01
INFO: include disk 'sata0' 'vmdata-32K-2:vm-1514-disk-0' 250G
INFO: include disk 'sata1' 'vmdata-32K-2:vm-1514-disk-1' 190G
INFO: include disk 'sata2' 'vmdata-32K-2:vm-1514-disk-4' 350G
INFO: include disk 'sata3' 'vmdata-32K-2:vm-1514-disk-2' 4000G
INFO: include disk 'sata4' 'vmdata-32K:vm-1514-disk-0' 1000G
INFO: include disk 'sata5' 'vmdata-32K-2:vm-1514-disk-3' 1000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/1514/2025-05-21T22:18:37Z'
ERROR: VM 1514 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id '62570:6' (expected 62570:7)
INFO: resuming VM again
ERROR: Backup of VM 1514 failed - VM 1514 qmp command 'backup' failed - got timeout
INFO: Failed at 2025-05-22 00:21:09
INFO: Backup job finished with errors
TASK ERROR: job errors