[SOLVED] PBS - multiple backup failed VM every day

plnt

Member
Jan 20, 2022
5
3
8
30
Hi all,

We have been using your products in our company for many years.
Not long ago, we completely renovated the entire server infrastructure. So far there are 9 PVE with 3 PMG clusters on VM and standalone physical server - PBS 3.0-1

We have a problem with backups. Every day we need to back up a large number of VMs. Many of them have 2 TB disks - mail servers. All the ones that failed have the same log:
Bash:
2018: 2023-07-27 20:12:47 INFO: Starting Backup of VM 2018 (qemu)
2018: 2023-07-27 20:12:47 INFO: status = running
2018: 2023-07-27 20:12:47 INFO: VM Name: webmail01.wh.local
2018: 2023-07-27 20:12:47 INFO: include disk 'virtio0' 'nvme_pool:vm-2018-disk-1' 15G
2018: 2023-07-27 20:12:47 INFO: include disk 'virtio1' 'nvme_pool:vm-2018-disk-2' 16G
2018: 2023-07-27 20:12:47 INFO: include disk 'efidisk0' 'nvme_pool:vm-2018-disk-0' 528K
2018: 2023-07-27 20:12:48 INFO: backup mode: snapshot
2018: 2023-07-27 20:12:48 INFO: ionice priority: 7
2018: 2023-07-27 20:12:48 INFO: snapshots found (not included into backup)
2018: 2023-07-27 20:12:48 INFO: creating Proxmox Backup Server archive 'vm/2018/2023-07-27T18:12:47Z'
2018: 2023-07-27 20:12:48 INFO: issuing guest-agent 'fs-freeze' command
2018: 2023-07-27 20:14:53 INFO: issuing guest-agent 'fs-thaw' command
2018: 2023-07-27 20:14:53 ERROR: VM 2018 qmp command 'backup' failed - got timeout
2018: 2023-07-27 20:14:53 INFO: aborting backup job
2018: 2023-07-27 20:20:27 INFO: resuming VM again
2018: 2023-07-27 20:20:27 ERROR: Backup of VM 2018 failed - VM 2018 qmp command 'backup' failed - got timeout

Moreover, when it VM are backed up, PBS is unavailable for most of the time.
Everything is backed up via a 25Gb network, which is reserved only for backups and ceph. PVE has NVMe and SSD ceph pool and PBS has Seagate Exos 4Tb 7200rpm drive with ZFS raidz2.

I tried to stop the prune job and the garbage collector, as well as the verify task, but nothing helped. I also tried to split the backup jobs...The VMs have CentOS Stream 9 - current even with qemu-guest-agent version 8.0.0.8.

Thank you very much for any help or suggestions.
Andrej Lacho
 

Attachments

  • pbs.png
    pbs.png
    46.7 KB · Views: 8
  • pbs_graph.png
    pbs_graph.png
    124.4 KB · Views: 8
  • pve.png
    pve.png
    164.8 KB · Views: 7
  • pve_backup_jobs.png
    pve_backup_jobs.png
    44.5 KB · Views: 7
  • pve_tasks.png
    pve_tasks.png
    55 KB · Views: 8
Hi,
how does the load on your PBS look like? Maybe the spinning disks just can't keep up with all the data coming over the network, did you already try to set bandwidth limits (can be done on either Proxmox VE or PBS side)?
 
Hi Fiona,

yes, it is caused by the disks.
I recommend adding an NVMe PCie card for ZFS data.And divide the backup into several jobs. This helped me. Because SATA disks could not handle 15 jobs at once.
 

Attachments

  • backup_jobs_new.png
    backup_jobs_new.png
    84.7 KB · Views: 8
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!