Speed up Backupserver by adding zfs special device ?

adoII · Mar 10, 2021

I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.
But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?

fabian · Mar 10, 2021

adoII said:
I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.

yes, verify will not benefit much from a special device, the bulk of the time is spent reading data (not metadata) from disk and checksumming it (CPU intensive).

adoII said:
But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?

it will definitely benefit from it, since GC is almost 100% metadata access and modification (except for deleting the garbage chunks at the end

)

adoII said:
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?

mirrored special vdev should work, it might need -f (you can easily test this with a VM)

adoII said:
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?

only newly written metadata nodes are stored on the special vdev. no need to delete and recreate the backups if you have the space to temporarily duplicate the data, you can zfs send/recv into a new dataset, then "disable" the datastore, do another incremental send/recv, and switch the datasets by renaming. the last two steps should preferably be done when no backups or other tasks are running/scheduled.

fabian · Mar 10, 2021

two important things you should keep in mind though:
- you need fast and high-endurance SSDs (or preferably, NVME devices). no consumer/prosumer HW
- if your special vdev mirror is toast, the whole pool is gone

adoII · Mar 10, 2021

Thanks for your help Fabian, the idea to duplicate the zfs data and later rename seems to be a really good way.
BTW: The lessons to always use Datacenter SSDs for Datacenter purposes we have learned many years ago.

guletz · Mar 10, 2021

Hi @adoII and @fabian

Special device is very useful but at the same time is dangerous as @fabian wrote, even with DC ssd!

Imho, it is a high probability that both ssd to be broken at the same time, because of the same load on both.

So I would try to minimize this event:

- use different supplier for each ssd
- before use them as special devices I would try to load them with very diiferent load for 2-3 weeks (I would use one of them as a zfs l2 cache and the second as a slog )
- smartctl long test

In such case after 2-3 weeks you will get this:

- both ssd are tested
- each ssh will have different numbers of write sectors, so the probability to be broken at the same time will be lower

Then you can add this ssd's as special devices.

Good luck / Bafta

guletz · Mar 10, 2021

fabian said:
it will definitely benefit from it, since GC is almost 100% metadata access and modification

Hi @fabian ,

I know this is not the right place to make sugestions, so please be kind .... (luck of time), and maybe I am wrong

- maybe before any CG task will be usefull to do something(in code or as a option in web interface) like ls -lR /path/datastore, so all metadata will be in PBS cache ?
- as I remember (I do not have a PBS now in my sight. ..) in the datastore is a huge tree with folders for each checksum chunk - so for each backup PBS will need to read this tree at least:

- why it is not something like ipset?
- maybe it will be better to have this huge folder only on a separate device (zfs dataset I have in mind) where I can set up metadata only

Good luck / Bafta !

fabian · Mar 11, 2021

guletz said:
Hi @fabian ,

I know this is not the right place to make sugestions, so please be kind .... (luck of time), and maybe I am wrong

- maybe before any CG task will be usefull to do something(in code or as a option in web interface) like ls -lR /path/datastore, so all metadata will be in PBS cache ?

that would not help much, but you can try yourself by doing that ls before starting GC

guletz said:
- as I remember (I do not have a PBS now in my sight. ..) in the datastore is a huge tree with folders for each checksum chunk - so for each backup PBS will need to read this tree at least:

- why it is not something like ipset?

what do you mean with "like ipset"?

guletz said:
- maybe it will be better to have this huge folder only on a separate device (zfs dataset I have in mind) where I can set up metadata only

that huge folder IS the whole chunk store containing all your backup DATA. you definitely don't want to do metadata-only caching there - else repeated hits for the same chunks won't be cached properly..

oversite · Mar 11, 2021

adoII said:
I like proxmox backup server but budget only allows magnetic disks and raidz2.
So especially verify and garbage collect procedures need a hell of a lot of time.
For an 8 TB Backup space today verify needs 2 days and garbage collect needs 1 day.
I think with verify there is not much i can do. Either do it or not.
But a garbage collection should be much faster with a ZFS Special Device on SSD, right ?
Now my questions:
Can I add a mirrored special device to a zfs raidz2 pool ? I read somewhere that the special device must have the same redundancy level as the pool it is on. So this would mean I would need a special device being a raidz2 ?
If I add a special device, how do i populate the zfs metadata to the special device ? Will I have to delete and recreate all the backups or will the metadata magically move to the special device ?

if you have a raidz2 you have protection for two failed disks, special devices as mirrored vdev is cool, but to keep the same level of protection you could do a three disk mirror.

adoII · Mar 11, 2021

Wow, I added a zfs special device on 2 mirrored datacenter ssds and populated it by duplicating the data into a new zfs dataset.
Now my garbage collection on 8 TB backup data on a 9 spinning disk raid2z only took 20 Minutes instead of 18 hours before.
The special device with the metadata has 30 GB in use now and since there is not much written to it the ssds should live nearly forever ....

Klug · Mar 12, 2021

Looks like i have to do this too, because I'm facing the same issue: >24 hours of garbage collection for 9 TB of backup data on a 12 disks ZRAID3 pool.

Now on the track for SSD/NMVe to do the job.

DerDanilo · Mar 14, 2021

Does anybody know how to properly calculate special device storage requirements?
E.g., 100TB raidz2 with 10 disks, log and read cache already on nvme would require how many TB storage for metadata only (no small files stored here, metadata only) on special devices?
If there a formula one can use here?

Klug · Mar 14, 2021

0.02 ratio between storage and specialdevice.
https://forum.proxmox.com/threads/size-on-special-device-for-metadata-cache.78720/

So for 100TB of storage, about 2TB of special device.

fabian · Mar 15, 2021

it depends on your recordsize/volblocksize, and the former is just an upper limit and not the actual value used

you can do a scan of an existing pool with 'zdb' to get more exact numbers, but as always, planning with an extra buffer is sensible anyway. e.g., I have a pool with 8TB of data and 20G of metadata on a pair of special vdevs (out of 15T / 128G capacity), but that is with lots of bigger files and a higher than default recordsize (so the ratio data/metadata is higher than with other workloads). if you ever run out of space on your special vdev, it will transparently spill-over into your regular vdevs (so performance will suffer, but data will still be there and accessible).

kalleb · Sep 30, 2022

Hi, is there any reason why Proxmox does not recommend setting primarycache=metadata (instead of all) on a PBS with zfs?

Filling the ARC with (backup) data you will hopefully never read again is pretty much useless. This reduces the available cache for metadata.
By setting primarycache=metadata, I've managed to increase backup throughput from 60 to 180 MiB/s and reduced garbage collection from hours to seconds.
The problem with cache misses on metadata is that missing data has to be fetched synchronously from disk (hence suspending write operations) which in case of slow HDDs may completely throttle down the writes.
Of course you can use special devices (SDD) but this is IMHO a waste of money.
Give primarycache=metadata a try.

Our PBS is a 6 core HT Intel (seen as 12 CPUs) , 16 GB RAM, 6x4TB raidz2 HDDs with 10 Gb NICs. No L2ARC, no SLOG, no special devices.

DerDanilo · Oct 1, 2022

Thanks for the hint!

How did you configure the cache exactly? Globally of pool based?
Until now I used a good amount of RAM, L2ARC and SLOG to speed up things when configuring a PBS. Special devices are quiet expensive depending on the amount of metadata required (eg. 30 large HDDs + 2 fast DC NVMes).

Dunuin · Oct 1, 2022

Would be more useful to use the SSDs as special devices storing only metadata. Benefit will be that a ARC with primarycache=metadata will only help with reads. Special devices will also help with writes (even async write which a SLOG won't). Only benefit of a ARC or L2ARC would be that you won't loose any data when the SSD dies.

And enterprise SSDs are still cheaper than more RAM of the same size

And Proxmox products come with default ZFS values. There is not a single ZFS optimization out of the box. You are supposed to optimize it yourself and ZFS isn‘t meant to work optimal out of the box.
You always have to change configs before using it.

kalleb · Oct 1, 2022

DerDanilo said:
Thanks for the hint!

How did you configure the cache exactly? Globally of pool based?
Until now I used a good amount of RAM, L2ARC and SLOG to speed up things when configuring a PBS. Special devices are quiet expensive depending on the amount of metadata required (eg. 30 large HDDs + 2 fast DC NVMes).

zfs set primarycache=metadata <datastore pool name>

VictorSTS · Oct 4, 2022

Using primarycache=metadata, operations like /usr/bin/proxmox-backup-client snapshots --repository 'repo' --output-format text take much longer, within the x8 range (from 10 secs to around 80). I suppose that such command needs to read something from the repo files themselves which benefit from having primarycache=all.

We use such command to monitor for backup existance and freshness, maybe there is an alternative that I have overlooked.

I'm still trying to find out the performance of my repos caching just metadata for other operations like verify or even restore.

fabian · Oct 5, 2022

yes, that one (and many others) will read more than just metadata (well, they'll read metadata in PBS, but that is not 100% metadata on the filesystem level..)

Speed up Backupserver by adding zfs special device ?

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Famous Member

Famous Member

Proxmox Staff Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Active Member

Proxmox Staff Member

New Member

Renowned Member

Distinguished Member

New Member

Renowned Member

Proxmox Staff Member