Hello experts!
We’re experiencing an intermittent and unpredictable issue with backups in our Proxmox cluster, and I’m hoping the community can help us understand what's going wrong — or advise on better tuning and scheduling.
Setup Overview:
Problem Summary:
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id
Observations and Suspicions:
Questions for the Community:
I’d appreciate any input, tuning advice, or even confirmation that this is a known limitation or bug we should account for.
We’re experiencing an intermittent and unpredictable issue with backups in our Proxmox cluster, and I’m hoping the community can help us understand what's going wrong — or advise on better tuning and scheduling.

- PVE Cluster: ~30 nodes
- PBS Server:1 server (bare metal)
- OS Disk: SSD
- Datastore: HDD with 1x SSD configured as a special device for metadata
- Backup mode: snapshot
- Verification: Enabled after backups
- Garbage Collection: Managed using a-time, runs regularly
- Storage: Ceph-backed VM disks
- Concurrency: Backups run nightly via scheduled jobs across nodes

- PVE backup jobs occasionally fail with a timeout, but the corresponding backup actually completes successfully on PBS minutes (or even hours) later.
- The failure message is usually (vm id's are for identification of the attachments):
INFO: aborting backup job
ERROR: VM 1514 qmp command 'backup-cancel' failed - got wrong command id
- PBS log shows backup fully streamed and finalized.
- Similar pattern — PVE marked it failed due to timeout, PBS has a completed and verifiable backup.

- PBS server I/O loadmay be the root issue:
- We suspect backup verification running after backup job may increase PBS load and cause response delays back to PVE.
- Garbage collection may also be compounding the issue if overlapping.
- PVE backup timeout seems to occur at the archive creation point, not during the bulk of the data transfer.

- Has anyone else seen this pattern — timeouts at archive finalization but successful backups on PBS?
- Any way to increase the QMP timeout for the backup operation in PVE?
- Are there known best practices for scheduling verification and GC to avoid overlap with backups?
- Is there a recommended I/O monitoring approach to confirm PBS load at time of failure?
I’d appreciate any input, tuning advice, or even confirmation that this is a known limitation or bug we should account for.