Backup PVE cluster with 15 nodes and over 650 VMs

Apr 2, 2018
18
1
43
Hi guys,

I have a PVE cluster with 15 nodes and over 650 virtual machines, some containers and growing every day. Until now I'm doing backup with one PBS for all nodes, I have a schedule that starts the backup on each node with 15 minutes of delay. Until some time I start to have time out errors because of the PBS can't handle all the job.

I have thought in create a virtual PBS for each node and centralize all datastores in CephFS folders. I have made a some test and I find that I migrate a virtual machine between nodes the dirty map is broken and the backup is complete.

For avoid this dirty maps problems I can sync backups between PBS but this implies duplicate backup space for each PBS server.

There is a configuration in which I could have more than one PBS to distribute the backup load and I could migrate the virtual machines between nodes and not duplicate al backups space?

Thanks in advance for your help.
 
Upgrading your PBS sounds like the simplest solution here.
If it is already able to keep up most of the time, a little pushup should let it get the job done.

What are the hardware specs on your PBS machine and how is it connected to the network? 10G, 25G? Bonding?
 
Last edited:
Thanks for your answer @SINOS
I think get better hardware is only a patch, I need horizontal scalability. The number of virtual machines grow every day, is imposible grow with brute force and for this is necessary a big budget.
If PBS can't grow in horizontal is only for small business cluster. PVE can grow in horizontal adding more nodes so PVE needs that backups grow also in horizontal way.
Before use PBS I did backups with the script for Ceph, all my machines are over varios Ceph storages, of Corsinvest. I modified for work with more than Ceph cluster at same time. It was reliable, fast and horizontal scalable because of it ran in each PVE node, in other hand it need a good sysadmin to be operate and for backups restore and hasn't API neither web panel.
I prefer use PBS but I need an horizontal growing way.
I can't believe I'm the only one with this problem.
Thanks for your answers!
 
  • Like
Reactions: DerDanilo
To prevent confusion: @SINOS is my account at work, this is my private account.

Did you check if there was any other job running on PBS during the timeouts? Like a GC or Verify Task?
I experienced massive problems when running GC and backups in parallel (on HDDs though) - changing the GC schedule fixed/avoided the issue.

PVE can grow in horizontal adding more nodes so PVE needs that backups grow also in horizontal way.
To a certain limit, yes. You can build pretty big clusters, but at some point you leave the area of "well-known and tested" setups.

I think get better hardware is only a patch, I need horizontal scalability.
I get your point. The current specs would be interesting anyway. I'm very curious!
Especially interesting is the storage setup here - are you using HDDs, SSDs? Metadata devices?
Sometimes only a minor adjustment like an metadata special device could unlock lots of performance.

About the scalability.
You could use multiple PBS servers for different clusters or different hosts, of course, but that brings additional management overhead as well.
That virtualized PBS instances with CephFS underneath sound interesting but I think it'd be inefficient for a single cluster (same encryption key) because you are sacrificing on your deduplication rate with 15 separated datastores instead of one big datastore for the cluster.

I'm curious if it may be possible to use the same CephFS from multiple PBS instances... but iirc there is a lock file in the datastore that would prevent parallel access, and there would be (possibly) lots of other problems because of concurrent file access, like simultaneous gargabe collections.

If you are willing to sacrifice "some" storage space, you could stay with the virtualized PBS instance per host and just setup remote sync jobs on each of them, so they send all their datastore content to your current (primary) PBS instance to have them centralized again.
What makes me curious is how exactly your PBS is struggling, because the most heavy lifting is done on PVE. If your storage is at its limit, the sync jobs could timeout / overwhelm the primary PBS as well.
 
@tuxis is offering PBS as a Service, so I assume they found some way to scale-out PBS.
Maybe they can help?
 
We are saturating a 20Gbit link with backups from the cluster and on large clusters one needs to be able limit the number of concurrent nodes which run backup tasks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!