Hard drive strategy

Taledo · Dec 8, 2023

Hey all,

Continuing on my journey to create a somewhat good Proxmox Backup Server infrastructure for my company, I'm now considering different hard drives configuration, but am a total noob when it comes to high (for us) quantity storage.

As of now, the idea would be to have around 100Tb of capacity for our PBS. We're going to use hardware raid controllers because I don't feel confident enough using ZFS on those systems, especially if someone else has to replace a dead drive.

The idea would be to use RAID6, with either 10*12Tb drives, or 8*16Tb drives. The 8*16Tb drives are about 1k€ cheaper than the 10-12Tb drives, but I do know that rebuilding the raid in case of a failed drive would take longer.

Any advice on choosing the right solution is appreciated.

Thanks All

sb-jw · Dec 8, 2023

I have my doubts that you actually need 100 TB of backup storage. You should first set up a PBS and back up a few VMs several times, maybe just let it run for 7 days. My PBS has a dedup factor of 30, which saves a ton of storage.

My personal recommendation is to run the PBS with SSD only. As a rule, you won't have any fun with HDDs; all activities take much longer with hard drives than with SSDs. The HDDs perhaps serve as a second PBS, which then does the long-term archiving.

Dunuin · Dec 8, 2023

Yes, HDDs are basically unusable with this amount of storage. Especially when not running in a Raid10 and without having SSDs to store the metadata.
You really need some kind of raid that allows for tiered storage like ZFS that allows you to not store the metadata on those slow HDDs.

Keep in mind that 100TB of storage means reading and writing the metadata of all 50-100 million chunk files on each GC task which is mostly random IO so terrible with the seek time of HDDs. Verify jobs will be terrible too, especially when doing re-verifies after X days and here metadata on SSDs can't help that much as all the data on the HDD would have to be read and hashed again.

Taledo · Dec 8, 2023

sb-jw said:
I have my doubts that you actually need 100 TB of backup storage. You should first set up a PBS and back up a few VMs several times, maybe just let it run for 7 days. My PBS has a dedup factor of 30, which saves a ton of storage.

My personal recommendation is to run the PBS with SSD only. As a rule, you won't have any fun with HDDs; all activities take much longer with hard drives than with SSDs. The HDDs perhaps serve as a second PBS, which then does the long-term archiving.

I currently have 30Tb of dedup'ed backups on the running PBSs with dedup factors between 60 and 110 depending on the PBS. Considering a worst-case data growth of 20% each year we'd be reaching 76Tb of backups in 5 years. While 100Tb might be oversized, I do not think it is oversized by a lot.

I agree with you both on the fact that SSD would be better, but high density SSD storage comes at a cost that is hard to justify. I think a good middle ground here would be using fast special devices for metadata cache, but this would mean going for an HBA and using ZFS for the whole array.

Dunuin · Dec 8, 2023

I think the most budget friendly way thats somewhat usable would be how Tuxis are setting up their servers. So SSDs as special devices and then the HDDs as striped tripple mirrors (so three copies of everything = 33% raw capacity usable...or even less because you shouldn't will a ZFS pool too much...so maybe more like 27-30% actually usable). But according to maintanance mails and backup performance it looks like they also struggle a bit with the performance.
With that in mind it might not cost that much more when building a pool with pure SSDs in something like a striped 6-disk raidz2/raid6 where you only would need to buy half the raw capacity. Raidz2 isn't great for performance, but as you want IOPS performance and HDDs are really horrible at that, I would bet that a raidz2 with U.2 SSDs would still be way faster than a striped mirror of SAS HDDs.
12x 15.36TB Samsung PM9A3 U.2: ~106TB usable capacity(with 90% quota) in a 2x 6 disk raidz2 = 14,006€
24x 7.68TB Samsung PM983 U.2: ~106TB usable capacity(with 90% quota) in a 4x 6 disk raidz2 = 9,214€
15x 22TB Seagate Exos X22 22TB SAS + 3x Samsung PM9A3 960GB: ~99TB usable capacity(with 90% quota) in a 5x 3 disk mirror = 5,455€ + 355€ = 5,810€

Taledo · Dec 8, 2023

Alright, thanks for your input. I'm going to reassess and consider using "tiering" for my PBSs : SSD only "Primary" PBSs for short & medium term backups, and RAID6 HDD only PBSs for long-term storage. This might not be the most optimal way of doing it, but it sure is going to be much better than what we have now.

Cheers

Felix. · Dec 8, 2023

Taledo said:
RAID6 HDD only PBSs for long-term storage

Did something like this some time ago, with 8 10TB drives, but using ZFS and dRAID RAID6 and a special device mirror (ab)using the rest of my boot drives, I partitioned them to only use 32gb for the OS.
With a little tuning I got very acceptable results in GC and Verify Jobs, but that machine was considered to be for long time archival and disaster recovery, not for immediate restores on-site.

Using Hardware RAID sounds terrible, I'm sorry.
Even when using a battery backup, the write buffer will gain you nothing really.
Teaching your Staff in ZFS basics, how to replace a drive, may be the cheaper, better and future proof way to go!

The GC times will be horrible and you loose all the goodies ZFS offer, especially having a checksummed filesystem that can heal itself.
You will fully rely on PBS' verify jobs for integrity checks and those will be ridiculously slow without Special Device mirror or at least a large L2ARC (with persistent l2arc configured).
The latter may be done using a read cache or SSD cache tier on the RAID controller, but those features are usually costly.

You could think about a global hotspare - that way you don't loose integrity because ZFS (like a hardware raid) can instantly begin to heal itself.
And someone confident with ZFS could replace the faulty disk later. Using dRAID and a distributed hotspare would also help with the rebuild time.

For onsite backup I'd prefer NVMe datastores, not only because of GC and Verify Jobs, but for Restores!
When something breaks you want it to be restored quickly, or even use the lovely Live Restore feature of PBS.
But that requires IOPS and a beefy network connection, not talking about 10G here.

Beside that, there are other specs of an PBS that matter, like CPU and RAM (the latter especially with ZFS).
It sounds like you are still planning this, maybe we can help with that?
If you'd share your requirements and budget.

Taledo · Dec 8, 2023

I'd very much like help on this, as my experience with ZFS never left the homelab.

I agree with you that hardware raid is bad. Furthermore, I'd also like to use ZFS, but I need a solid understanding of it first, so I can be sure that when a drive eventually dies, either me or someone else with documentation can fix it.

Here's the context of my setup and what I'm trying to achieve.

I've currently got 4 Proxmox Clusters, one per physical datacenter.

I'll name those clusters DC1, DC2, DC3 and DC4. Currently, I have 3 main PBSs, in DC1, DC3, and DC4.

I've set up my clusters so that my backups are offsite, meaning that DC1 backups to DC3, DC2 to DC1, DC3 to DC1 and... DC4 to DC4. No, this isn't a typo. Yes, there are currently no running offsite backups of DC4.

Now, first and foremost, PBS performances across the network aren't good. That's a fact. But this was the only way I could get offsite backups to be (mostly) a thing in the first place. When I got into the company, losing a DC would mean losing everything on that DC, but I've since succeeded in getting about enough crap servers and drives to set up something that runs surprisingly well. At least one of my current PBS is a R515 with an AMD Opteron.

Size wise, we're looking at this :

DC	DC1	DC2	DC3	DC4
Backup Size, deduped	15Tb	2Tb	3Tb	12Tb

With a growth rate of about 15% per year, again not evenly across DCs.

Now to the fun part : doing better

I don't have a final budget for this, but I'd say 30k€ is a good place to start.

I'm surprised by the prices given by Dunuin for SSDs, but those would make SSD PBS way more affordable.

Chassis wise, we're using mainly Dell, so reckon we'll probably go with something like a R730 or R740, depending on pricing. I was thinking of a correct CPU (I don't have the exact reference but it was two intel 10c/20t CPUs) with 64Gb of DDR4 RAM. If I do go with ZFS, I'll probably go for 128Gb instead.

Not many requirements, really, I just need something that will hold for about 5 years without major investment in it. I won't put a PBS in DC4 as its not a true DC per se, and lacks power and internet redundancy.

I was thinking of having two PBS per DC for DC1, DC2, and DC3, one with SSDs, that would handle short and medium term backups, and one with slower drives for archival purposes. DC1 would backup to the SSD PBS in DC1, then that PBS would sync to the long-term storage in DC2 and DC3 for offsite backups. SSD PBS would have about, say, between 25 and 30Tb of usable disk space (accounting for RAID) while HDD servers would have more, allowing for one PBS in a DC to back up all other DCs on it.

I might be absolutely wrong, though, so I'd be happy to hear your thoughts on this.

Cheers

Felix. · Dec 8, 2023

Taledo said:
I've set up my clusters so that my backups are offsite, meaning that DC1 backups to DC3

So, you have no onsite backup and it backups live over WAN?
Or do you have dark fibers there? What is the bandwidth and latency between the datacenters?

Taledo said:
Now, first and foremost, PBS performances across the network aren't good. That's a fact.

I'm surprised this works at all without backup jobs failing or machines freezing frequently.
Did direct backups over WAN before, but not at terabyte scale and it always had some quirks from time to time.

Taledo said:
I'm surprised by the prices given by Dunuin for SSDs, but those would make SSD PBS way more affordable.

Take a look at mindfactory for example. Ignoring the 19% VAT, I get around 14,5k€ net for 12x PM9A3 15.36 TB there.
Asking an offer from a distributor will probably yield even better pricings.
But in general, NVMe is getting cheaper and cheaper, to a point where SATA SSDs are not even senseful anymore.

HDD is still cheaper at the same capacity, but the performance difference cannot be described.
Modern Datacenter NVMes just outrun whatever you throw at them and allow you to do things that would be mad on spinning disks.
Worth it!

About the RAID level, I'd personally go with a RAID50 (2x 6 disk) or one RAID6 (1x 12 disk), depending on the level of paranoia.
A rebuild is no big thing when your drives shovel gigabytes per second around, and technology like dRAID even further reduces the rebuild time.

Taledo said:
two intel 10c/20t CPUs) with 64Gb of DDR4 RAM. If I do go with ZFS, I'll probably go for 128Gb instead.

Try to avoid Intel CPUs, I don't know if this has improved with the latest generation, but in general they are notoriously slow in PBS benchmarks, see here: https://forum.proxmox.com/threads/how-fast-is-your-backup-datastore-benchmark-tool.72750/

To really utilize such high performing drives, you'll need some serious single core performance as well as some threads.
I'd recommend something like the AMD EPYC 9274F which is the latest generation and has massive clock speeds as well as most modern instruction sets. PBS relies heavily on cryptography, so does ZFS - checksums everywhere

As the backups chunks are encrypted and hashed at the client, your backup speed probably don't improve a lot, depending on the load of the current machines (i mean, its an Opteron), but Verify Jobs for example will greatly benefit from a modern and strong CPU.

About RAM, the rule of thumb when using ZFS is to have 1GB per 1TB pool capacity, so 128GB should be safe but since there are additional services like PBS itself, 196GB would'nt hurt and won't break the bank.

Taledo said:
DC1 would backup to the SSD PBS in DC1, then that PBS would sync to the long-term storage in DC2 and DC3 for offsite backups

I'd not build 2 servers per DC.
And 2 offsite backups (so 3 backups) increases your storage requirements.

While being cheaper on their own, adding the HDD costs to the additional hardware and operational costs required for a whole separate box, I see the economic gains dwindling.

I'm thinking about 2 big PBS that remote sync each other and having DC3 and DC4 directly backup to whichever is available.
Eh, I'll attach a diagram. That way you have 2 backups of each DC, at 2 different locations.

If you don't already, consider using a single datastore with one namespace per DC, instead of one datastore per DC.
That way you can share the chunkstore across all datacenter backups and further improve deduplication rate.

This concept could work with 3 PBS (DC1,2,3) as well and with 2 offsite backups like you suggested, but at considerably higher costs then.
With 100TB each I'd roughly calculate about 20-30k for each PBS, depending on the margin added by your distributor.

Taledo · Dec 9, 2023

Felix. said:
So, you have no onsite backup and it backups live over WAN?
Or do you have dark fibers there? What is the bandwidth and latency between the datacenters?

We do indeed have dark fibers between each DC. Its a 10G loop except for DC4 which has 500M.

Felix. said:
I'm surprised this works at all without backup jobs failing or machines freezing frequently.
Did direct backups over WAN before, but not at terabyte scale and it always had some quirks from time to time.

I believe there's a misunderstanding here. This is my fault, and I haven't precise enough. The backup size I stated earlier is on disk size. Nightly volume is between 100G to 200G.

Felix. said:
Take a look at mindfactory for example. Ignoring the 19% VAT, I get around 14,5k€ net for 12x PM9A3 15.36 TB there.
Asking an offer from a distributor will probably yield even better pricings.
But in general, NVMe is getting cheaper and cheaper, to a point where SATA SSDs are not even senseful anymore.

This doesn't make sense in my head but I'll take that as a win

Felix. said:
HDD is still cheaper at the same capacity, but the performance difference cannot be described.
Modern Datacenter NVMes just outrun whatever you throw at them and allow you to do things that would be mad on spinning disks.
Worth it!

About the RAID level, I'd personally go with a RAID50 (2x 6 disk) or one RAID6 (1x 12 disk), depending on the level of paranoia.
A rebuild is no big thing when your drives shovel gigabytes per second around, and technology like dRAID even further reduces the rebuild time.

Felix. said:
Try to avoid Intel CPUs, I don't know if this has improved with the latest generation, but in general they are notoriously slow in PBS benchmarks, see here: https://forum.proxmox.com/threads/how-fast-is-your-backup-datastore-benchmark-tool.72750/

I was looking at AMD CPUs (Dell 7515) but there's a weird bug that makes core freezes after 1044 days. The AMD CPU you give below isn't ROME though.

Felix. said:
To really utilize such high performing drives, you'll need some serious single core performance as well as some threads.
I'd recommend something like the AMD EPYC 9274F which is the latest generation and has massive clock speeds as well as most modern instruction sets. PBS relies heavily on cryptography, so does ZFS - checksums everywhere

As the backups chunks are encrypted and hashed at the client, your backup speed probably don't improve a lot, depending on the load of the current machines (i mean, its an Opteron), but Verify Jobs for example will greatly benefit from a modern and strong CPU.

We do not (yet) encrypt backups as we stay in our network, but I might add it to the list as I think its a great security feature.

Felix. said:
About RAM, the rule of thumb when using ZFS is to have 1GB per 1TB pool capacity, so 128GB should be safe but since there are additional services like PBS itself, 196GB would'nt hurt and won't break the bank.

I'd not build 2 servers per DC.
And 2 offsite backups (so 3 backups) increases your storage requirements.

While being cheaper on their own, adding the HDD costs to the additional hardware and operational costs required for a whole separate box, I see the economic gains dwindling.

I'm thinking about 2 big PBS that remote sync each other and having DC3 and DC4 directly backup to whichever is available.
Eh, I'll attach a diagram. That way you have 2 backups of each DC, at 2 different locations.

If you don't already, consider using a single datastore with one namespace per DC, instead of one datastore per DC.
That way you can share the chunkstore across all datacenter backups and further improve deduplication rate.

I am already using one namespace per DC and will continue doing so as its easier than to create datastore each time.

Felix. said:
This concept could work with 3 PBS (DC1,2,3) as well and with 2 offsite backups like you suggested, but at considerably higher costs then.
With 100TB each I'd roughly calculate about 20-30k for each PBS, depending on the margin added by your distributor.

Thanks for your inputs!

Felix. · Dec 9, 2023

Taledo said:
We do indeed have dark fibers between each DC. Its a 10G loop except for DC4 which has 500M.

Ahh, alright, that will be fine for a smooth and fast backup.
I initially thought you are backing up over WAN.
With some 100gb daily delta the remote syncs should be quick as well.

Taledo said:
This doesn't make sense in my head but I'll take that as a win

The thing with NVMe is that you need the PCIe lanes, older platforms simply don't provide that many.
Best you can do there is cheating with multiplexers like HoneyBadger carrier cards... or fallback to SATA/SAS.

Taledo said:
(Dell 7515)

Dell 7515 is a Milan (3rd gen) EPYC, that will also do good work.
Dell 7615 has Genoa (4th gen, latest)

https://www.dell.com/de-de/shop/ser...server-premium/spd/poweredge-r7515/per751509a
Select a 2,5" drive chassis and "NVMe Backplane", for Milan I'd choose a 74F3 as CPU.

Took a look at their drive prices. Madness.

Beside all the marketing buzzwording without naming the actual drive or providing detailed specs.

Would buy them from a distributor with 5Y NBD service when available, otherwise adding some spares to the cart and call it a day.

Taledo said:
We do not (yet) encrypt backups as we stay in our network, but I might add it to the list as I think its a great security feature.

It is. When you roll it out, consider using a Master Key: https://pbs.proxmox.com/docs/backup...ster-key-to-store-and-recover-encryption-keys
This greatly improves usability when working with multiple servers and you only need to keep the Master Key stored as paperkey somewhere for emergencies.

About the CPU usage, the backup client still does at least compression and hashing of the chunks and sends them via HTTP/2 (-> TLS), so skipping (or introducing) the AES encryption won't make that much of a difference.

Taledo · Dec 9, 2023

Felix. said:
Ahh, alright, that will be fine for a smooth and fast backup.
I initially thought you are backing up over WAN.
With some 100gb daily delta the remote syncs should be quick as well.

To be fair the main bottleneck in our current setup is definitely the PBS. If, one day, the main bottleneck is the network, I'll be a happy man.

Felix. said:
The thing with NVMe is that you need the PCIe lanes, older platforms simply don't provide that many.
Best you can do there is cheating with multiplexers like HoneyBadger carrier cards... or fallback to SATA/SAS.

Dell 7515 is a Milan (3rd gen) EPYC, that will also do good work.
Dell 7615 has Genoa (4th gen, latest)

https://www.dell.com/de-de/shop/ser...server-premium/spd/poweredge-r7515/per751509a
Select a 2,5" drive chassis and "NVMe Backplane", for Milan I'd choose a 74F3 as CPU.

Took a look at their drive prices. Madness.
Beside all the marketing buzzwording without naming the actual drive or providing detailed specs.

Dell public prices are dumb. They get better if you have a business relationship with them, but they still get you on the ram and storage.

Reckon we might pick the chassis at Dell as you still get next day on-site support, which is a nice thing to have, and pick up drives elsewhere.

Something just dawned on me. I was surprised by the lack of raid controller for NVMe drives, but (correct me if I'm wrong) because they are PCIe devices, they don't need a raid controller or HBA! They just plug right in!

Felix. said:
Would buy them from a distributor with 5Y NBD service when available, otherwise adding some spares to the cart and call it a day.

It is. When you roll it out, consider using a Master Key: https://pbs.proxmox.com/docs/backup...ster-key-to-store-and-recover-encryption-keys
This greatly improves usability when working with multiple servers and you only need to keep the Master Key stored as paperkey somewhere for emergencies.

About the CPU usage, the backup client still does at least compression and hashing of the chunks and sends them via HTTP/2 (-> TLS), so skipping (or introducing) the AES encryption won't make that much of a difference.

Agree, I'm already using this system on my homelab, I have that qrcode printed out somewhere.

Felix and Dunuin, thank you again for your help with this. It is a daunting task designing a proper backup solution, and the usual method of throwing bigger numbers at it isn't working this time around.

I'll seek where I can buy drives at and what price, and combine all that in a way that will be understandable by upper management : how much money will it cost us.

Cheers, and a happy holiday seasons to you all.

Search

Search

Hard drive strategy

Taledo

Active Member

sb-jw

Famous Member

Dunuin

Distinguished Member

Taledo

Active Member

Dunuin

Distinguished Member

Taledo

Active Member

Felix.

Renowned Member

Taledo

Active Member

Felix.

Renowned Member

Attachments

Taledo

Active Member

Felix.

Renowned Member

Taledo

Active Member