Backup PVE cluster with 15 nodes and over 650 VMs

Jesus Blanco · Sep 23, 2021

Hi guys,

I have a PVE cluster with 15 nodes and over 650 virtual machines, some containers and growing every day. Until now I'm doing backup with one PBS for all nodes, I have a schedule that starts the backup on each node with 15 minutes of delay. Until some time I start to have time out errors because of the PBS can't handle all the job.

I have thought in create a virtual PBS for each node and centralize all datastores in CephFS folders. I have made a some test and I find that I migrate a virtual machine between nodes the dirty map is broken and the backup is complete.

For avoid this dirty maps problems I can sync backups between PBS but this implies duplicate backup space for each PBS server.

There is a configuration in which I could have more than one PBS to distribute the backup load and I could migrate the virtual machines between nodes and not duplicate al backups space?

Thanks in advance for your help.

SINOS · Sep 23, 2021

Upgrading your PBS sounds like the simplest solution here.
If it is already able to keep up most of the time, a little pushup should let it get the job done.

What are the hardware specs on your PBS machine and how is it connected to the network? 10G, 25G? Bonding?

Jesus Blanco · Sep 23, 2021

Thanks for your answer @SINOS
I think get better hardware is only a patch, I need horizontal scalability. The number of virtual machines grow every day, is imposible grow with brute force and for this is necessary a big budget.
If PBS can't grow in horizontal is only for small business cluster. PVE can grow in horizontal adding more nodes so PVE needs that backups grow also in horizontal way.
Before use PBS I did backups with the script for Ceph, all my machines are over varios Ceph storages, of Corsinvest. I modified for work with more than Ceph cluster at same time. It was reliable, fast and horizontal scalable because of it ran in each PVE node, in other hand it need a good sysadmin to be operate and for backups restore and hasn't API neither web panel.
I prefer use PBS but I need an horizontal growing way.
I can't believe I'm the only one with this problem.
Thanks for your answers!

Felix. · Sep 23, 2021

To prevent confusion: @SINOS is my account at work, this is my private account.

Did you check if there was any other job running on PBS during the timeouts? Like a GC or Verify Task?
I experienced massive problems when running GC and backups in parallel (on HDDs though) - changing the GC schedule fixed/avoided the issue.

Jesus Blanco said:
PVE can grow in horizontal adding more nodes so PVE needs that backups grow also in horizontal way.

To a certain limit, yes. You can build pretty big clusters, but at some point you leave the area of "well-known and tested" setups.

Jesus Blanco said:
I think get better hardware is only a patch, I need horizontal scalability.

I get your point. The current specs would be interesting anyway. I'm very curious!
Especially interesting is the storage setup here - are you using HDDs, SSDs? Metadata devices?
Sometimes only a minor adjustment like an metadata special device could unlock lots of performance.

About the scalability.
You could use multiple PBS servers for different clusters or different hosts, of course, but that brings additional management overhead as well.
That virtualized PBS instances with CephFS underneath sound interesting but I think it'd be inefficient for a single cluster (same encryption key) because you are sacrificing on your deduplication rate with 15 separated datastores instead of one big datastore for the cluster.

I'm curious if it may be possible to use the same CephFS from multiple PBS instances... but iirc there is a lock file in the datastore that would prevent parallel access, and there would be (possibly) lots of other problems because of concurrent file access, like simultaneous gargabe collections.

If you are willing to sacrifice "some" storage space, you could stay with the virtualized PBS instance per host and just setup remote sync jobs on each of them, so they send all their datastore content to your current (primary) PBS instance to have them centralized again.
What makes me curious is how exactly your PBS is struggling, because the most heavy lifting is done on PVE. If your storage is at its limit, the sync jobs could timeout / overwhelm the primary PBS as well.

Felix. · Sep 23, 2021

@tuxis is offering PBS as a Service, so I assume they found some way to scale-out PBS.
Maybe they can help?

Ben B · Sep 28, 2021

Felix. said:
@tuxis is offering PBS as a Service, so I assume they found some way to scale-out PBS.
Maybe they can help?

We do a similar thing for customers in Australia if you're interested. Got a global footprint we can use too if location/data-sovereignty is an issue.

eXtremeSHOk · Jul 25, 2022

We are saturating a 20Gbit link with backups from the cluster and on large clusters one needs to be able limit the number of concurrent nodes which run backup tasks.

Search

Search

Backup PVE cluster with 15 nodes and over 650 VMs

Jesus Blanco

Active Member

SINOS

Member

Jesus Blanco

Active Member

Felix.

Renowned Member

Felix.

Renowned Member

Ben B

Active Member

eXtremeSHOk

Renowned Member

We value your privacy