Speed up Backupserver by adding zfs special device ?

adoII

Renowned Member
Jan 28, 2010
174
17
83
I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.
But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?
 
I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.

yes, verify will not benefit much from a special device, the bulk of the time is spent reading data (not metadata) from disk and checksumming it (CPU intensive).

But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?
it will definitely benefit from it, since GC is almost 100% metadata access and modification (except for deleting the garbage chunks at the end ;))
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?

mirrored special vdev should work, it might need -f (you can easily test this with a VM)
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?
only newly written metadata nodes are stored on the special vdev. no need to delete and recreate the backups if you have the space to temporarily duplicate the data, you can zfs send/recv into a new dataset, then "disable" the datastore, do another incremental send/recv, and switch the datasets by renaming. the last two steps should preferably be done when no backups or other tasks are running/scheduled.
 
two important things you should keep in mind though:
- you need fast and high-endurance SSDs (or preferably, NVME devices). no consumer/prosumer HW
- if your special vdev mirror is toast, the whole pool is gone
 
Thanks for your help Fabian, the idea to duplicate the zfs data and later rename seems to be a really good way.
BTW: The lessons to always use Datacenter SSDs for Datacenter purposes we have learned many years ago.
 
Hi @adoII and @fabian

Special device is very useful but at the same time is dangerous as @fabian wrote, even with DC ssd!

Imho, it is a high probability that both ssd to be broken at the same time, because of the same load on both.

So I would try to minimize this event:

- use different supplier for each ssd
- before use them as special devices I would try to load them with very diiferent load for 2-3 weeks (I would use one of them as a zfs l2 cache and the second as a slog )
- smartctl long test

In such case after 2-3 weeks you will get this:

- both ssd are tested
- each ssh will have different numbers of write sectors, so the probability to be broken at the same time will be lower


Then you can add this ssd's as special devices.

Good luck / Bafta
 
it will definitely benefit from it, since GC is almost 100% metadata access and modification

Hi @fabian ,

I know this is not the right place to make sugestions, so please be kind .... (luck of time), and maybe I am wrong

- maybe before any CG task will be usefull to do something(in code or as a option in web interface) like ls -lR /path/datastore, so all metadata will be in PBS cache ?
- as I remember (I do not have a PBS now in my sight. ..) in the datastore is a huge tree with folders for each checksum chunk - so for each backup PBS will need to read this tree at least:

- why it is not something like ipset?
- maybe it will be better to have this huge folder only on a separate device (zfs dataset I have in mind) where I can set up metadata only


Good luck / Bafta !
 
Hi @fabian ,

I know this is not the right place to make sugestions, so please be kind .... (luck of time), and maybe I am wrong

- maybe before any CG task will be usefull to do something(in code or as a option in web interface) like ls -lR /path/datastore, so all metadata will be in PBS cache ?

that would not help much, but you can try yourself by doing that ls before starting GC

- as I remember (I do not have a PBS now in my sight. ..) in the datastore is a huge tree with folders for each checksum chunk - so for each backup PBS will need to read this tree at least:

- why it is not something like ipset?

what do you mean with "like ipset"?

- maybe it will be better to have this huge folder only on a separate device (zfs dataset I have in mind) where I can set up metadata only
that huge folder IS the whole chunk store containing all your backup DATA. you definitely don't want to do metadata-only caching there - else repeated hits for the same chunks won't be cached properly..
 
I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.
But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?
if you have a raidz2 you have protection for two failed disks, special devices as mirrored vdev is cool, but to keep the same level of protection you could do a three disk mirror.
 
Wow, I added a zfs special device on 2 mirrored datacenter ssds and populated it by duplicating the data into a new zfs dataset.
Now my garbage collection on 8 TB backup data on a 9 spinning disk raid2z only took 20 Minutes instead of 18 hours before.
The special device with the metadata has 30 GB in use now and since there is not much written to it the ssds should live nearly forever ....
 
  • Like
Reactions: tom and fabian
Looks like i have to do this too, because I'm facing the same issue: >24 hours of garbage collection for 9 TB of backup data on a 12 disks ZRAID3 pool.

Now on the track for SSD/NMVe to do the job.
 
Does anybody know how to properly calculate special device storage requirements?
E.g., 100TB raidz2 with 10 disks, log and read cache already on nvme would require how many TB storage for metadata only (no small files stored here, metadata only) on special devices?
If there a formula one can use here?
 
it depends on your recordsize/volblocksize, and the former is just an upper limit and not the actual value used ;) you can do a scan of an existing pool with 'zdb' to get more exact numbers, but as always, planning with an extra buffer is sensible anyway. e.g., I have a pool with 8TB of data and 20G of metadata on a pair of special vdevs (out of 15T / 128G capacity), but that is with lots of bigger files and a higher than default recordsize (so the ratio data/metadata is higher than with other workloads). if you ever run out of space on your special vdev, it will transparently spill-over into your regular vdevs (so performance will suffer, but data will still be there and accessible).
 
Hi, is there any reason why Proxmox does not recommend setting primarycache=metadata (instead of all) on a PBS with zfs?

Filling the ARC with (backup) data you will hopefully never read again is pretty much useless. This reduces the available cache for metadata.
By setting primarycache=metadata, I've managed to increase backup throughput from 60 to 180 MiB/s and reduced garbage collection from hours to seconds.
The problem with cache misses on metadata is that missing data has to be fetched synchronously from disk (hence suspending write operations) which in case of slow HDDs may completely throttle down the writes.
Of course you can use special devices (SDD) but this is IMHO a waste of money.
Give primarycache=metadata a try.

Our PBS is a 6 core HT Intel (seen as 12 CPUs) , 16 GB RAM, 6x4TB raidz2 HDDs with 10 Gb NICs. No L2ARC, no SLOG, no special devices.
 
  • Like
Reactions: DerDanilo
Thanks for the hint!

How did you configure the cache exactly? Globally of pool based?
Until now I used a good amount of RAM, L2ARC and SLOG to speed up things when configuring a PBS. Special devices are quiet expensive depending on the amount of metadata required (eg. 30 large HDDs + 2 fast DC NVMes).
 
Would be more useful to use the SSDs as special devices storing only metadata. Benefit will be that a ARC with primarycache=metadata will only help with reads. Special devices will also help with writes (even async write which a SLOG won't). Only benefit of a ARC or L2ARC would be that you won't loose any data when the SSD dies.

And enterprise SSDs are still cheaper than more RAM of the same size ;)

And Proxmox products come with default ZFS values. There is not a single ZFS optimization out of the box. You are supposed to optimize it yourself and ZFS isn‘t meant to work optimal out of the box.
You always have to change configs before using it.
 
Last edited:
  • Like
Reactions: DerDanilo
Thanks for the hint!

How did you configure the cache exactly? Globally of pool based?
Until now I used a good amount of RAM, L2ARC and SLOG to speed up things when configuring a PBS. Special devices are quiet expensive depending on the amount of metadata required (eg. 30 large HDDs + 2 fast DC NVMes).
zfs set primarycache=metadata <datastore pool name>
 
  • Like
Reactions: DerDanilo
Using primarycache=metadata, operations like /usr/bin/proxmox-backup-client snapshots --repository 'repo' --output-format text take much longer, within the x8 range (from 10 secs to around 80). I suppose that such command needs to read something from the repo files themselves which benefit from having primarycache=all.

We use such command to monitor for backup existance and freshness, maybe there is an alternative that I have overlooked.

I'm still trying to find out the performance of my repos caching just metadata for other operations like verify or even restore.
 
  • Like
Reactions: VinnyG
yes, that one (and many others) will read more than just metadata (well, they'll read metadata in PBS, but that is not 100% metadata on the filesystem level..)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!