PBS scaling out storage

tjk · Oct 11, 2021

How do you scale out storage once a PBS node starts to run out of space?

For example, I setup a PBS server with 20TB of backup space...in 2 years I start to hit 80/90% of that capacity, what is the best way to add more storage to the PBS server and use that for existing backup jobs?

It would be cool if you had a concept of scale out repositories - where you just present a bunch of datastores to a set of backup jobs and it load balances/uses them across jobs.

tuxis · Oct 25, 2021

We use ZFS pools and more disks to scale out, which works like a charm.

tjk · Oct 25, 2021

tuxis said:
We use ZFS pools and more disks to scale out, which works like a charm.

I assume you are presenting datastores as nfs mount points then? You just grow the NFS mount point as you need more space on a datastore?

tuxis · Nov 16, 2021

tjk said:
I assume you are presenting datastores as nfs mount points then? You just grow the NFS mount point as you need more space on a datastore?

No. We create a filesystem per datastore, locally on the local ZFS pool. We set a quota per user, each user has its own main filesystem and a per-datastore sub-filesystem.

tjk · Nov 16, 2021

tuxis said:
No. We create a filesystem per datastore, locally on the local ZFS pool. We set a quota per user, each user has its own main filesystem and a per-datastore sub-filesystem.

Yea, except this is exactly the problem I am describing. How do you keep adding disk to a server that has fixed disk capacity?

Also, adding disk to a ZFS pool isn't the most efficient since it doesn't rebalance the existing data on the disk.

tuxis · Nov 16, 2021

You add external cabinets where you can add disks..

Indeed, growing a ZFS pool does not rebalance data, but that does not cause any issues so far. There is some data that is being deleted and some new data being added. In the long run, this balances enough.

Felix. · Nov 16, 2021

Growing a ZFS pool sounds like a good solution to me, for quite some time. One can add a hell lot of disks using some JBODs.
And nowadays there are pretty large HDDs, too.

If that doesn't fit, building a ceph cluster could work for further scaling, but if you expect to reach 20TB in like 2-3 years, that'd be overkill imho.
In some years, there are probably even larger HDDs and you can simply replace some of your current disks and gain storage that way - pro: no rebalance necessary in that case.

If you want to go crazy and keep it all in one shelf no matter what for a very long time, there are several 90-bay cases available from Supermicro for example and manufacturers like NimbusData offer SSDs up to 100TB each.

PBS also supports LTO tape libraries, those can also store lots of data in a single chassis.

tjk · Nov 16, 2021

We don't have the luxury of having empty cabs to just put up more jbods and chain em together, plus that only scales so far.

I think we'll stick to using NFS mounts for datastores.

Felix. · Nov 16, 2021

You can buy and deploy those JBODs on demand, no need to keep empty cabs around.
If you are fine with multiple datasources (NFS share per datastore) anyway, you have lots of additional possibilities anyway.
It sounded like you really want to have one single machine serve all your datastores, at all time.

Beside that, 20TB is not that much. I'm getting a fresh PBS machine with 8x 10TB HDDs - using striped mirrors (RAID-10) and reserving 20% to not overload the ZFS pool, I still end up with about 30 TiB of useable storage.
So, I don't really see a problem for your usecase, as HDDs already go to like 16 or 18TB per Disk nowadays.

tjk · Nov 16, 2021

It isn't the buying jbods that is the issue. It's having an empty cab nearby 2 years later to add more jbods to the existing pool. In 2 years when I have to deploy another jbod, that cab might be in another aisle and not something I can SAS connect to the existing pool.

Also, building out huge pools with spinning disk is a bad thing. I have datastore today with 30TB active on it, and the verify's take a long time to finish blocking backup jobs from running.

That is why ProxMox folks recommend building out datastores using SSD, which I disagree with btw, spinning disk is still cheaper then SSD's and last a lot longer then SSD's.

If I had a 5 TB pool, sure I'd do all SSD, but when you are planning 50 to 100TB of backup data and growing for PVE, SSD is a non starter.

Felix. · Nov 16, 2021

tjk said:
It isn't the buying jbods that is the issue. It's having an empty cab nearby 2 years later to add more jbods to the existing pool. In 2 years when I have to deploy another jbod, that cab might be in another aisle and not something I can SAS connect to the existing pool.

Ah, I understand. Well, building multiple smaller servers that provide NFS shares may be a more appropriate solution then.
Or, as long as its feasible, upgrading the existing HDDs to bigger ones.

tjk said:
Also, building out huge pools with spinning disk is a bad thing.

Why? ZFS can handle dozens of disks properly and with new features like dRAID even rebuilds kann be quite painless, given that a proper raid level (RAIDZ-2 or 3) is chosen for that amount of disks.

tjk said:
I have datastore today with 30TB active on it, and the verify's take a long time to finish blocking backup jobs from running.

The more HDDs the better the verify jobs should run, because those are basically streaming data and comparing checksums (may be cpu heavy).
For Garbage Collections etc. Special Devices could do a good job.

About the Verify Job specifically, I remember seeing some git commits about removing chunk locks for readonly operations, the latest updates should allow you to do backup jobs running a verify job in parallel.

tjk said:
That is why ProxMox folks recommend building out datastores using SSD, which I disagree with btw, spinning disk is still cheaper then SSD's and last a lot longer then SSD's.

I agree, SSDs are still way more expensive than HDDs, but I also saw the massive advantage of an all-ssd PBS.
If one needs that kind of performance, in the end that depends on your RTO requirements.
I made good experiences using L2ARC drives (4MB recordsize on my pool, so its cheap on RAM), even with cheap consumer SSDs.

tjk said:
If I had a 5 TB pool, sure I'd do all SSD, but when you are planning 50 to 100TB of backup data and growing for PVE, SSD is a non starter.

If you plan to go 100+ TB, you will probably switch the storage implementation at some point, anyway.
ZFS for some time and then maybe a (erasure coded) ceph cluster when you start scaling out massively.

tjk · Nov 16, 2021

Felix. said:
Why? ZFS can handle dozens of disks properly and with new features like dRAID even rebuilds kann be quite painless, given that a proper raid level (RAIDZ-2 or 3) is chosen for that amount of disks.

Good question for the PVE team, they seem to think SSD for datastores is the way to go, and that doesn't make sense at scale for sure.

Which is interesting, they support tape just fine, but anything on disk is Io intensive for verify's and such.

I hope improvements come for verify and such, my verify time is blocking backups right now. We are on a subscription plan and I haven't seen this patch hit the sub version yet.

Felix. · Nov 16, 2021

Ah, it looks like I misremembered.
I've found the commit I was thinking about: https://git.proxmox.com/?p=proxmox-...it;h=6eade0ebb76acfa262069575d7d1eb122f8fc2e2
But that is about a backup restores, not verifys.

tjk said:
I hope improvements come for verify and such

Overall, I dont see any magically performance "fix" coming, because the Verify Operation is simply disk intensive - its reading the whole 4MB chunk.
I don't see much room for optimizations there, as this already is a quite simple operation.
If the Verify really takes too long, cache drives may help with that.
On the other hand, HDDs and Tapes are very good at those streaming read workloads, usually.

Why the Verify blocks backup jobs is unclear to me, as verify should be a readonly operation anyway.
Whats preventing me from creating new backups in the background then, it should not interfere?

tuxis · Nov 17, 2021

When using ZFS, the verify-process IMHO is very overrated. It protects you from bitrot, which ZFS already has protection for.

Our largest PBS is currently about 90TB, filled with about 73TB of data. Verify as we speak reads about 2GB per second, and we try to limit the amount of verifies.

One thing that improves performance greatly, is adding NVME 'special' devices for ZFS. That mirror-set handles thousands of IOPS, which don't go to your spinning disks.

Also read https://forum.proxmox.com/threads/zfs-pool-io-when-idle.89911/#post-393317 and https://forum.proxmox.com/threads/is-verify-task-needed-with-zfs-backed-datastore.84081/#post-369460. A note on the latter, @t.lamprecht does not advice to disable verification, but he opens a window to make educated decisions about the use of verification.

Search

Search

PBS scaling out storage

tjk

Member

tuxis

Famous Member

tjk

Member

tuxis

Famous Member

tjk

Member

tuxis

Famous Member

Felix.

Renowned Member

tjk

Member

Felix.

Renowned Member

tjk

Member

Felix.

Renowned Member

tjk

Member

Felix.

Renowned Member

tuxis

Famous Member