Intermittent PVE → PBS Backup Timeouts Despite Successful Completion on PBS

CarlCanCode · May 30, 2025

Hello experts!

We’re experiencing an intermittent and unpredictable issue with backups in our Proxmox cluster, and I’m hoping the community can help us understand what's going wrong — or advise on better tuning and scheduling.

Setup Overview:

PVE Cluster: ~30 nodes
PBS Server:1 server (bare metal)
- OS Disk: SSD
- Datastore: HDD with 1x SSD configured as a special device for metadata
Backup mode: snapshot
Verification: Enabled after backups
Garbage Collection: Managed using a-time, runs regularly
Storage: Ceph-backed VM disks
Concurrency: Backups run nightly via scheduled jobs across nodes

Problem Summary:

PVE backup jobs occasionally fail with a timeout, but the corresponding backup actually completes successfully on PBS minutes (or even hours) later.
The failure message is usually (vm id's are for identification of the attachments):

ERROR: VM 1514 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id

PBS log shows backup fully streamed and finalized.

VM 3462 (~2 TiB):

Similar pattern — PVE marked it failed due to timeout, PBS has a completed and verifiable backup.

Observations and Suspicions:

PBS server I/O loadmay be the root issue:
- We suspect backup verification running after backup job may increase PBS load and cause response delays back to PVE.
- Garbage collection may also be compounding the issue if overlapping.
PVE backup timeout seems to occur at the archive creation point, not during the bulk of the data transfer.

Questions for the Community:

Has anyone else seen this pattern — timeouts at archive finalization but successful backups on PBS?
Any way to increase the QMP timeout for the backup operation in PVE?
Are there known best practices for scheduling verification and GC to avoid overlap with backups?
Is there a recommended I/O monitoring approach to confirm PBS load at time of failure?

I have attached logs for the pve and pbs tasks.
I’d appreciate any input, tuning advice, or even confirmation that this is a known limitation or bug we should account for.

Chris · May 30, 2025

Hi,
please share the full task logs for the backup jobs in PVE and also post the output of qm config <VMID> --current with the respective VMIDs as well as cat /etc/pve/storage.cfg. From your comments and log outputs it doesn't seem the PBS is the issue here.

CarlCanCode · May 30, 2025

Thanks for the response.
Sorry I thought I attached the pve task logs, but here they are.

qm config 1514 --current

autostart: 1
boot: dcn
bootdisk: sata0
cores: 12
ipconfig0: gw=xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx
memory: 360448
name: TOPSQL01
net0: virtio=xx:xx:xx:xx,bridge=vmbr0,firewall=1,tag=103
onboot: 1
ostype: win10
sata0: vmdata-32K-2:vm-1514-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=250G
sata1: vmdata-32K-2:vm-1514-disk-1,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=190G
sata2: vmdata-32K-2:vm-1514-disk-4,iops_rd=5000,iops_wr=5000,mbps_rd=250,mbps_wr=250,size=350G
sata3: vmdata-32K-2:vm-1514-disk-2,iops_rd=6000,iops_wr=6000,mbps_rd=250,mbps_wr=250,size=4000G
sata4: vmdata-32K:vm-1514-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=1000G
sata5: vmdata-32K-2:vm-1514-disk-3,iops_rd=5000,iops_wr=5000,mbps_rd=160,mbps_wr=160,size=1000G
scsihw: virtio-scsi-pci
smbios1: uuid=68ff8667-60e7-464c-9dad-fed87c8f2cfa
vmgenid: e510a01e-22f8-42d6-a066-d09dfc994325

qm config 3462 --current

agent: 1
bootdisk: scsi0
cores: 16
ide2: pve-cloudinit:3462/vm-3462-cloudinit.qcow2,media=cdrom
memory: 49152
name: TRLS-FUSION-01
net0: virtio=xx:xx:xx:xx,bridge=vmbr0,firewall=1,tag=2401
numa: 0
onboot: 1
ostype: win8
scsi0: vmdata-16K-3:vm-3462-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=80,mbps_wr=80,size=512G
scsi1: vmdata-32K-2:vm-3462-disk-0,iops_rd=5000,iops_wr=5000,mbps_rd=80,mbps_wr=80,size=2000G
scsihw: virtio-scsi-pci
smbios1: uuid=c60ffa5d-df89-491e-b5fd-8de91819a388
sockets: 1
vmgenid: 05085b06-6d59-4d45-8f9e-44d39312e265

cat /etc/pve/storage.cfg

dir: local
path /var/lib/vz
content iso,vztmpl,backup

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

zfspool: vmdata-16K
pool rpool/vmdata-16K
blocksize 16k
content rootdir,images
sparse 1

zfspool: vmdata-32K
pool rpool/vmdata-32K
blocksize 32k
content images,rootdir
sparse 1

zfspool: vmdata-16K-2
pool rpool-2/vmdata-16K-2
blocksize 16k
content images,rootdir
mountpoint /rpool-2/vmdata-16K-2
nodes pve-110-XXX,
pve-301-XXX,
pve-127-XXX,
pve-119-XXX,
pve-129-XXX,
pve-115-XXX,
pve-113-XXX,
pve-restore-XXX,
pve-102-XXX,
pve-22-XXX
,pve-119-XXX,
pve-124-XXX,
pve-111-XXX,
pve-105-XXX,
pve-23-XXX,
pve-16-XXX,
c12-XXX,
pve-103-XXX,
pve-128-XXX,
pve-112-XXX,
pve-125-XXX,
pve-120-XXX,
pve-109-XXX,
pve-108-XXX,
pve-126-XXX,
pve-107-XXX,
pve-114-XXX,
pve-104-XXX,
pve-106-XXX,
pve-15-XXX,
pve-12-XXX,
pve-101-XXX
sparse 1

zfspool: vmdata-16K-3
pool rpool-3/vmdata-16K-3
blocksize 16k
content rootdir,images
mountpoint /rpool-3/vmdata-16K-3
nodes c12-XXX,pve-15-XXX
sparse 1

zfspool: vmdata-32K-2
pool rpool-2/vmdata-32K-2
blocksize 32k
content images,rootdir
mountpoint /rpool-2/vmdata-32K-2
nodes pve-127-XXX,
pve-119-XXX,
pve-129-XXX,
pve-115-XXX,
pve-113-XXX,
pve-110-XXX,
pve-301-XXX,
pve-23-XXX,
pve-16-XXX,
c12-XXX,
pve-103-XXX,
pve-restore-XXX,
pve-22-XXX,
pve-119-XXX,
pve-124-XXX,
pve-102-XXX,
pve-111-XXX,
pve-105-XXX,
pve-109-XXX,
pve-108-XXX,
pve-120-XXX,
pve-107-XXX,
pve-126-XXX,
pve-114-XXX,
pve-128-XXX,
pve-112-XXX,
pve-125-XXX,
pve-101-XXX,
pve-104-XXX,
pve-106-XXX,
pve-15-XXX
sparse 1

glusterfs: images
path /mnt/pve/images
volume images
content images,iso
prune-backups keep-all=1
server xxx.xxx.xxx.xxx
server2 xxx.xxx.xxx.xxx

pbs: PBS01-Datastore
datastore PBS01-Datastore
server xxx.xxx.xxx.xxx
content backup
fingerprint *****
prune-backups keep-all=1
username ***

zfspool: vmdata-32K-3
pool rpool-3/vmdata-32K-3
blocksize 32k
content images,rootdir
mountpoint /rpool-3/vmdata-32K-3
nodes pve-115-XXX, c12-XXX, pve-124-XXX
sparse 1

dir: cloudinit-local
path ***
content images
prune-backups keep-all=1
shared 1

glusterfs: pve-cloudinit
path ***
volume vol_main
content images,iso
prune-backups keep-all=1
server xxx.xxx.xxx.xxx
server2 xxx.xxx.xxx.xxx

zfspool: rpool-3
pool rpool-3
content images,rootdir
mountpoint /rpool-3
nodes pve-124-XXX, pve-115-XXX
sparse 1

zfspool: rpool-2
pool rpool-2
content images,rootdir
mountpoint /rpool-2
nodes pve-124-XXX, pve-126-XXX, pve-125-XXX

rbd: ceph-replicated-3-dh1-oadc-jhb1
content rootdir,images
krbd 0
nodes pve-125-XXX, pve-126-XXX, pve-124-XXX
pool ceph-replicated-XXXX

cifs: ca-pve-images
path ****
server xxx.xxx.xxx.xxx
share IMAGES-SMB
content iso,images
prune-backups keep-all=1
username ***

cifs: ca-pve-cloudinit
path ***
server xxx.xxx.xxx.xxx
share CLOUDINIT-SMB
content iso,images
prune-backups keep-all=1
username ***

pve task logs:

INFO: trying to get global lock - waiting...
INFO: got global lock

INFO: starting new backup job: vzdump 3462 --compress 1 --remove 0 --storage PBS01-Datastore --lockwait 2880 --prune-backups 'keep-last=99999' --node pve-15-XXX

INFO: Starting Backup of VM 3462 (qemu)
INFO: Backup started at 2025-05-30 00:17:56
INFO: status = running
INFO: VM Name: TRLS-FUSION-01
INFO: include disk 'scsi0' 'vmdata-16K-3:vm-3462-disk-0' 512G
INFO: include disk 'scsi1' 'vmdata-32K-2:vm-3462-disk-0' 2000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/3462/2025-05-29T22:17:50Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 3462 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 3462 qmp command 'backup-cancel' failed - got wrong command id '91418:6' (expected 91418:7)
INFO: resuming VM again
ERROR: Backup of VM 3462 failed - VM 3462 qmp command 'backup' failed - got timeout
INFO: Failed at 2025-05-30 00:21:52
INFO: Backup job finished with errors
TASK ERROR: job errors

INFO: trying to get global lock - waiting...
INFO: got global lock

INFO: starting new backup job: vzdump 1514 --lockwait 2880 --storage PBS01-Datastore --remove 0 --node pve-16-XXX --prune-backups 'keep-last=99999' --compress 1

INFO: Starting Backup of VM 1514 (qemu)
INFO: Backup started at 2025-05-22 00:18:37
INFO: status = running
INFO: VM Name: TOPSQL01
INFO: include disk 'sata0' 'vmdata-32K-2:vm-1514-disk-0' 250G
INFO: include disk 'sata1' 'vmdata-32K-2:vm-1514-disk-1' 190G
INFO: include disk 'sata2' 'vmdata-32K-2:vm-1514-disk-4' 350G
INFO: include disk 'sata3' 'vmdata-32K-2:vm-1514-disk-2' 4000G
INFO: include disk 'sata4' 'vmdata-32K:vm-1514-disk-0' 1000G
INFO: include disk 'sata5' 'vmdata-32K-2:vm-1514-disk-3' 1000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/1514/2025-05-21T22:18:37Z'
ERROR: VM 1514 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id '62570:6' (expected 62570:7)
INFO: resuming VM again
ERROR: Backup of VM 1514 failed - VM 1514 qmp command 'backup' failed - got timeout
INFO: Failed at 2025-05-22 00:21:09
INFO: Backup job finished with errors
TASK ERROR: job errors

CarlCanCode · Jun 3, 2025

As a side note, we haven't started looking into the wrong command ID yet, as the main problem here is the backups timing out.
We are seeing a significant increase in failing backups, all with the same output as above in the last week.
We tried to let the verification of the backups happen a little later so that backups and verification doesn't interfere with each other, but reverted those changes.

Chris · Jun 3, 2025

CarlCanCode said:
As a side note, we haven't started looking into the wrong command ID yet, as the main problem here is the backups timing out.
We are seeing a significant increase in failing backups, all with the same output as above in the last week.

Did you already check the io pressure on the PBS while the backup timeout error happens?

CarlCanCode said:
We tried to let the verification of the backups happen a little later so that backups and verification doesn't interfere with each other, but reverted those changes.

Did you see an effect of the verification jobs not running after the backup immediately?

Chris said:
From your comments and log outputs it doesn't seem the PBS is the issue here.

My comment above was referring to the got wrong command id error, the timeouts are most likely the PBS being overloaded. How may jobs do you have in parallel (backup jobs, verification jobs, ecc...). Try to minimize the concurrently running jobs and see if that reduces the timeout issues.

Search

Search

Intermittent PVE → PBS Backup Timeouts Despite Successful Completion on PBS

CarlCanCode

New Member

Attachments

Chris

Proxmox Staff Member

CarlCanCode

New Member

CarlCanCode

New Member

Chris

Proxmox Staff Member

We value your privacy