Split backupjobs for large-ish cluster

Apr 29, 2021
35
6
13
47
Hello.
I run a cluster with 7 nodes - at the moment 60 VMs and a few lxc's. Since we are moving from vmware, the number of vm's will grow. For now, the goal is about 200-250 vm's, but it will grow. I have 6 more hosts potentially joining.
I suffer from timeout from the PBS, occasionally (random vm:s all the time):

INFO: Starting Backup of VM 159 (qemu)
INFO: Backup started at 2024-04-03 21:13:50
INFO: status = running
INFO: VM Name: blabla
INFO: include disk 'scsi0' 'storage:159/vm-159-disk-0.qcow2' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/159/2024-04-03T19:13:50Z'
ERROR: VM 159 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 159 failed - VM 159 qmp command 'backup' failed - got timeout
INFO: Failed at 2024-04-03 21:17:30

I wonder if it's possible and/or a smart thing to split backup jobs, so node1 backs up at 21.00, node 2 @ 22.00 and so on? Possibly it hurts dirty-bitmap and dedup if vm:s are migrated from one host to another in that case? Any other drawbacks?
Is it possible to exclude vm:s if I go this route? VM:s will not be on the same host at all times, live migrations will take place. F.ex if I exclude VM106 on node 1, then i guess it won't be excluded if migrated to node 2...

Or should I just investigate why it's timing out...

The PBS is a 24 x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (2 Sockets) with 280GB RAM. The storage used for backup is a NFS mounted synology NAS.
PBS version: 3.1-2

Best regards
--
Markus
 
NFS will not perform properly for PBS, specially with such amount of VMs to backup [1]. Pretty sure that's why you are getting those timeouts. Sooner than later you should use local drives for PBS to get proper performance.

Meanwhile:

- You can create different backup jobs for each host and run them at different times. Live migration will not hurt neither dirty-map nor dedup, as long as you backup to the same storage/PBS datastore. Backup window will need to be bigger and you will have to manually estimate times properly.
- You can exclude VMs, but AFAIK you will have to exclude them on all hosts where that VM may run. If you don't use resource pools [2] for any other purpose, maybe you can create a backup job using a resource pool "VMs2backup" and for VMs that you don't want backups for, just remove the VM from that pool.


[1] https://pbs.proxmox.com/docs/system-requirements.html
[2] https://pve.proxmox.com/wiki/User_Management#pveum_resource_pools