Garbage collection speed

flai_hv

New Member
Nov 6, 2022
1
0
1
Hi devs and users,

Personally, I use proxmox in multiple domains (private and semi-professional).
For this reason, I have also two Backup servers running to handle the backups. All servers are doing daily backups, some even more, in total ~5TB with ~200GB changes per week.
One server is running with metadata SSD cached, data HDD ZFS filesystem as a VM. The other is "a bit weird", a cloud instance using remotely mounted storage for financial reasons.
Both of them run really well with a decent speed for backup creation and restore, even the remote storage one.

But one thing came up, the speed of garbage collection.
On the remote storage server, one run of garbage collection takes almost 2 days...
On the local storage server, one run takes 10 hours.

I know why this takes so long. PBS is going through all index files, "touches" all chunks assigned to it and later deletes all untouched chunks. This touching (even if you use features like relatime), involves the operating system and filesystem.
The required calls into the OS scale with amount of data aka chunks of data (or even better amount) and the amount of index files.
That means if you want to keep higher frequency backups, that quickly scales into billion touches per garbage collection cycles.

So, for all the users out there: What is your garbage collection duration? Anybody else out there having an "issue" like that?

But I didn't came here without an idea to change this.
I read (actually inside the code) that there were ideas to make this whole process inside memory. But that comes at a decent risk and memory footprint for large deployments.
The idea I have been trying on a replica of my backup server is deduplication of the touch requests.
There is no benefit of touching a chunk twice within one garbage collect. And the benefit is this calculation happens within PBS, no involvement of the OS. That can make this process magnitudes quicker. Especially for non-enterprise SSD deployments. Even for them, the lookup internal seems to be quicker than the touch, even on the NVME SSD I tried it with.
But that means a list of chunks needs to be kept in memory during garbage collection. In numbers 32MB per 1 million chunks. 1 million chunks means 4TB of VM data with default settings. It makes sense to truncate this data if it reaches a certain limit, what doesn't undermine the basic functionality, just slows it a bit down. But maybe that is fair if you have a single VM referencing 100TB+ on a single VM...

For the devs, what about memory footprint? Proxmox guides mention 1GB RAM per 1TB storage. So magnitudes more.
Do you think that such a solution has a chance?


BR

Florian
 
What is your garbage collection duration? Anybody else out there having an "issue" like that?

Dell gen12 box, mostly ssd with a hdd vdev for capacity.
3 TB of data, 57 Groups, 281 Snapshots

2024-08-25T18:00:42-04:00: starting garbage collection on store
2024-08-25T18:00:42-04:00: Start GC phase1 (mark used chunks)
2024-08-25T18:00:48-04:00: marked 1% (6 of 555 index files)
<snip>
2024-08-25T18:05:15-04:00: processed 99% (2197834 chunks)
2024-08-25T18:05:17-04:00: Removed garbage: 71.507 GiB
2024-08-25T18:05:17-04:00: Removed chunks: 73074
2024-08-25T18:05:17-04:00: Pending removals: 377.241 GiB (in 355233 chunks)
2024-08-25T18:05:17-04:00: Original data usage: 64.99 TiB
2024-08-25T18:05:17-04:00: On-Disk usage: 2.351 TiB (3.62%)
2024-08-25T18:05:17-04:00: On-Disk chunks: 1791641
2024-08-25T18:05:17-04:00: Deduplication factor: 27.65
2024-08-25T18:05:17-04:00: Average chunk size: 1.376 MiB
2024-08-25T18:05:17-04:00: TASK OK

Five minutes.
 
That first one was the primary, baremetal, made to perform.

This might be more fair.
Here's my secondary, runs as a TrueNAS guest virtual machine, so its right on top of the storage.
3 TB, 57 Groups, 299 Snapshots

2024-08-26T02:30:00-04:00: starting garbage collection on store
2024-08-26T02:30:00-04:00: task triggered by schedule '2,22:30'
2024-08-26T02:30:00-04:00: Start GC phase1 (mark used chunks)
2024-08-26T02:30:11-04:00: marked 1% (6 of 555 index files)
<snip>
2024-08-26T03:12:53-04:00: processed 99% (2154742 chunks)
2024-08-26T03:13:12-04:00: Removed garbage: 0 B
2024-08-26T03:13:12-04:00: Removed chunks: 0
2024-08-26T03:13:12-04:00: Pending removals: 384.927 GiB (in 355062 chunks)
2024-08-26T03:13:12-04:00: Original data usage: 64.99 TiB
2024-08-26T03:13:12-04:00: On-Disk usage: 2.393 TiB (3.68%)
2024-08-26T03:13:12-04:00: On-Disk chunks: 1821382
2024-08-26T03:13:12-04:00: Deduplication factor: 27.16
2024-08-26T03:13:12-04:00: Average chunk size: 1.377 MiB
2024-08-26T03:13:12-04:00: TASK OK

43 minutes.

BTW this is a non-typical day. I'm moving data around.
The secondary was just deployed, purging data from primary, shuffling it to secondary.
 
Last edited:
Just another random datapoint: this is a HP MicroServer Gen10, turned on once a week. With rotating rust in a single vdev = 4 drives in Raidz2 = worst case possible. No Special Device involved. Co-installed on PVE in an LXC, using a standard mountpoint for storage.

3.2 TB; 167 Groups, 2842 Snapshots

Code:
2024-08-25T07:06:00+02:00: starting garbage collection on store pbsc
2024-08-25T07:06:00+02:00: task triggered by schedule 'daily'
2024-08-25T07:06:00+02:00: Start GC phase1 (mark used chunks)
2024-08-25T07:07:43+02:00: marked 1% (41 of 4005 index files)
...
2024-08-25T07:36:48+02:00: marked 100% (4005 of 4005 index files)
2024-08-25T07:36:48+02:00: Start GC phase2 (sweep unused chunks)
2024-08-25T07:36:48+02:00: processed 1% (23338 chunks)
...
2024-08-25T07:36:59+02:00: processed 99% (2293891 chunks)
2024-08-25T07:36:59+02:00: Removed garbage: 0 B
2024-08-25T07:36:59+02:00: Removed chunks: 0
2024-08-25T07:36:59+02:00: Original data usage: 74.813 TiB
2024-08-25T07:36:59+02:00: On-Disk usage: 2.738 TiB (3.66%)
2024-08-25T07:36:59+02:00: On-Disk chunks: 2317035
2024-08-25T07:36:59+02:00: Deduplication factor: 27.33
2024-08-25T07:36:59+02:00: Average chunk size: 1.239 MiB
2024-08-25T07:36:59+02:00: TASK OK

30 minutes; I am surprised about zero chunks removed... obviously there was no backup in between...

The week before:
Code:
2024-08-18T09:53:00+02:00: starting garbage collection on store pbsc
2024-08-18T09:53:00+02:00: task triggered by schedule 'daily'
2024-08-18T09:53:00+02:00: Start GC phase1 (mark used chunks)
2024-08-18T09:53:47+02:00: marked 1% (42 of 4150 index files)
...
2024-08-18T10:21:36+02:00: marked 100% (4150 of 4150 index files)
2024-08-18T10:21:36+02:00: Start GC phase2 (sweep unused chunks)
2024-08-18T10:21:49+02:00: processed 1% (23701 chunks)
...
2024-08-18T11:05:09+02:00: processed 99% (2409651 chunks)
2024-08-18T11:05:36+02:00: Removed garbage: 109.197 GiB
2024-08-18T11:05:36+02:00: Removed chunks: 116999
2024-08-18T11:05:36+02:00: Original data usage: 74.813 TiB
2024-08-18T11:05:36+02:00: On-Disk usage: 2.738 TiB (3.66%)
2024-08-18T11:05:36+02:00: On-Disk chunks: 2317035
2024-08-18T11:05:36+02:00: Deduplication factor: 27.33
2024-08-18T11:05:36+02:00: Average chunk size: 1.239 MiB
2024-08-18T11:05:36+02:00: TASK OK

Okay, the same ~30 minutes for phase1 plus 44 minutes for the actual removal.
 
With rotating rust in a single vdev = 4 drives in Raidz2 = worst case possible. No Special Device involved.
...
~30 minutes for phase1

Just to complement my experience from above: some weeks ago I had added a Special Device consisting of three mirrored USB thingies. This is absolutely NOT recommended, but hey - it's a Homelab and I have multiple PBS' here :-)

Last night it did its weekly job as usual, so it was "warmed up" when I started manually an additional GC:
Code:
2025-02-23T09:04:11+01:00: starting garbage collection on store pbsc
2025-02-23T09:04:11+01:00: Start GC phase1 (mark used chunks)
2025-02-23T09:04:11+01:00: marked 1% (31 of 3063 index files)
...
2025-02-23T09:05:42+01:00: marked 100% (3063 of 3063 index files)
2025-02-23T09:05:42+01:00: Start GC phase2 (sweep unused chunks)
2025-02-23T09:05:43+01:00: processed 1% (20921 chunks)
...
2025-02-23T09:06:27+01:00: processed 99% (2055914 chunks)
2025-02-23T09:06:28+01:00: Removed garbage: 0 B
2025-02-23T09:06:28+01:00: Removed chunks: 0
2025-02-23T09:06:28+01:00: Original data usage: 57.061 TiB
2025-02-23T09:06:28+01:00: On-Disk usage: 2.511 TiB (4.40%)
2025-02-23T09:06:28+01:00: On-Disk chunks: 2076568
2025-02-23T09:06:28+01:00: Deduplication factor: 22.72
2025-02-23T09:06:28+01:00: Average chunk size: 1.268 MiB
2025-02-23T09:06:28+01:00: TASK OK

That run was in a "warm" state and it took only 2:17!

Too fast? Don't know without another test: I reboot and do the same, so all ARC are basically cold now:
Code:
2025-02-23T09:14:23+01:00: starting garbage collection on store pbsc
...
2025-02-23T09:18:35+01:00: Start GC phase2 (sweep unused chunks)
...
2025-02-23T09:18:46+01:00: TASK OK

So 4:23 this time, makes sense! (Later I notice that another container had started in parallel, so this measured duration is probably too high.) I run it immediately again, a third time:
Code:
2025-02-23T09:20:31+01:00: starting garbage collection on store pbsc
...
2025-02-23T09:22:05+01:00: TASK OK
Third run: 1:34

It looks like in my case this not-recommended construct has brought GC duration from ~30 minutes down to two to four minutes.

I am fine with this :-)
 
@UdoB what's so unrecommended about that?
Very similar config here.
Spinners in a raidz2. Mirror special vdev.

Sub-2 minute GC.


1740346478859.png

BTW, let's trade tuning notes!

Really, I'm curious where you drew the line between recordsize and special_small_blocks for PBS.
I ran the histogram prior to adding the SSDs, but making a judgement from it seems something of a dark art.
My workload is mixed win/lin. Lots of databases.

rpool recordsize 1M local

rpool special_small_blocks 512K local


rpool compression on local
rpool atime on local
rpool relatime on local
rpool redundant_metadata all default
rpool encryption off default
 
Last edited:
@UdoB what's so unrecommended about that?
a) the pure fact that my PBS datastore is based on rotating rust - it will slow down restores with or w/o Special Device
b) using USB on a critical device
c) in my case: cheap devices

I've had trouble again and again evaluating and actually using USB for datastores on the long run - usually with consumer grade devices. My personal impression: it works. Until it doesn't. (Remember: this is my Homelab - I would not tolerate this approach for my dayjob.)

In this specific case it is a tripple mirror not to match the redundancy level of the RaidZ2 (which would be the legitimate reason) but because the first iteration with two devices as a mirror was not stable.

BTW, let's trade tuning notes!
Sure:

My initial intention was to catch for metadata only. The "small blocks" aspect was a goodie I did not plan for. My random spare devices gave me 240 GB, and this seemed to be large enough to put some small blocks onto it, so I went for this:
Code:
~# zfs get  atime,encryption,relatime,redundant_metadata,recordsize,special_small_blocks  rpool/data/subvol-2004-disk-1
NAME                           PROPERTY              VALUE                 SOURCE
rpool/data/subvol-2004-disk-1  atime                 on                    inherited from rpool/data
rpool/data/subvol-2004-disk-1  encryption            off                   default
rpool/data/subvol-2004-disk-1  relatime              on                    inherited from rpool
rpool/data/subvol-2004-disk-1  redundant_metadata    all                   default
rpool/data/subvol-2004-disk-1  recordsize            256K                  inherited from rpool
rpool/data/subvol-2004-disk-1  special_small_blocks  128K                  inherited from rpool

NAME                                                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
special                                                -      -      -        -         -      -      -      -         -
  mirror-1                                          232G   127G   105G        -         -    77%  54.6%      -    ONLINE

With 55% used I hit the sweet spot I had hoped for :-)

For aiming I used a oneliner from some post here in the forum; I did not put the URL in my notes but only the command in itself:
Code:
find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

  1k:  23554
  2k:  20848
  4k:  26084
  8k:  28332
 16k:  35175
 32k:  61425
 64k:  98496
128k: 149448
256k: 278598
Edit for clarification: that data is from November 2024, before adding the SD ;-)
Just add up those lines to know the sum of bytes stored in those ominous "small blocks".
Code:
$ echo "23554*2^10 + 20848*2^11 + 26084*2^12 + 28332*2^13 + 35175*2^14 + 61425*2^15 + 98496*2^16 + 149448*2^17 + 278598*2^18 " | bc
102071109632
This includes a large amount of 256k blocks and it is "just" 102 Gigabyte, so it would fit on those 240GB, right? But with not knowing how much space the pure Metadata would actually allocate I stepped down to 128K. My current situation proves this decision was a good one :cool:

Conclusion for this one and post #6: "Special Devices" are crucial for some use cases, PBS on rotating rust being the most prominent - at least in my (small) world.

Oh... this detail is important: this configuration impacts only newly written data - adding a "Special Device" to an already filled up pool does not help! To read and write again every piece of data I used a script: https://github.com/markusressel/zfs-inplace-rebalancing
Meanwhile I believe there are better ways to do that!
 
Last edited:
Cool. Looking at your results and the eventual data distribution of my spinner+special setup, I've concluded that I could re-tune my recordsize and special_small_blocks, as i've only got a few hundred MB in the two TB of space. I'll give it more thought. No rush. It goes really fast.

using USB for datastores
You got it mapped by guid, so it always recovers properly ... right?


To read and write again every piece of data I used a script: https://github.com/markusressel/zfs-inplace-rebalancing
Oooo. That's really interesting. I figured if things were in place, you were just screwed and needed to build a new array or something. Hrmm. I like it.
 
Last edited:
You got it mapped by guid, so it always recovers properly ... right?

the issue with USB (besides potentially performance and endurance ;)) is that the device can disappear, and ZFS doesn't really handle this well. it's worst for single-device pools, where you often need a reboot to get out of the mess - but the same can happen if for example all three special USB devices share a common hub that for some reason disappears and reappears. the pool will stay imported without an option to export or revive it without a full reboot, as far as I know.
 
  • Like
Reactions: UdoB and Johannes S
You got it mapped by guid, so it always recovers properly ... right?
Yes, of course :)

I figured if things were in place, you were just screwed and needed to build a new array or something. Hrmm. I like it.
I had no critical problems and zero data loss doing all that. But that script is not optimal for our millions of small files. After two days I had an interruption (don't remember the reason). And while this would not damage any files this was annoying: the script writes a protocol of which files already have been handled. The idea is great, but for our small chunks it does not reduce the required IO on the next run as checking that log takes as long as it did for the "real work". At the end it run 8 days, reaching ~90% when it stopped again - and I did not bother to restart it again to handle the missing 10%.

Maybe my way to use it was sub-optimal. If I remember correctly reading the documentation more intense would have sped up execution... :-)
 
  • Like
Reactions: tcabernoch
Ok then. So my initial assessment was not inaccurate. Ur screwed.

I am not sure if there is a misunderstanding as English is not my native language. "screwed" means I am in a bad state? No!

I am absolutely fine with this sub-optimal setup in my Homelab - it works for me and I know exactly what to expect. That's why I called it "not recommended" ;-)
 
Um ... You understood me, but I was making a general statement about ZFS administration.
I meant to say that if one has a bad ZFS array build, its better to just tear it down than use this script to rewrite it, because 8 days is too long.

Oh ... and the USB biz ... I'm running one too.
It's a local office with no discrete storage, so I made a NAS, installed KVM on it as well, and installed PBS on that.
I don't get what folks are saying about it being hard to re-attach a disk with ZFS. I've had no issues of that sort, and I have had to do it.
 
Last edited:
  • Like
Reactions: UdoB
Maybe this is a bit to necro of a thread, but I think I am seeing some very excessive GC times. I have PBS running in a VM under proxmox (yea yea... its a homelab), proxmox boots on 2 nvme SSD's as do the VM's it hosts, so PBS should be plenty performant itself.

Its datastore however is an NFS volume of its virtualized truenas neighbor, which is running spinning rust vdev od 10x4TB wd reds, no metadata special device, just pure, slow, spinning rust.

I recenlty put my server in a nice HL15 case with no harddrive noise dampening and realized "wow, this thing is really loud now, the drives are always going nuts", well today I finally realized why...

1756185091835.png


Do I just have *too many backups*?

1756185204595.png


I guess PBS is *often* doing GC tasks that are taking way, way to long. I only have a 500GB NFS mount of this, and its only used to 300 ish GB. How in the world is it taking this long to do GC?

Any ideas? This seems..... broken. As far as I am aware, I am on all latest updates, 3.1-2 aaaaaand thats when he realized, non-subscription repo hasn't been anabled. Working on getting my system updated, lets hope that helps.
 
10x4TB wd reds, no metadata special device, just pure, slow, spinning rust.
Yeah, there is your reason.

When you really need to utilize rotating rust then at least you should build it as "striped mirrors" and add a mirrored "Special Device", from the start. Everything else might work at the beginning but the performance degrades during time, when both used space and fragmentation increases.

Edit: if you add a "Special Device" now, it does not help immediately. You need to read and write all the data to shuffle metadata to the new place. There are scripts for that...

Any ideas? This seems..... broken.
Well, it works... as expected.

You knew that in beforehand, right? PBS needs IOPS! https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements
 
Last edited:
  • Like
Reactions: Johannes S
Yeah, there is your reason.

When you really need to utilize rotating rust then at least you should build it as "striped mirrors" and add a mirrored "Special Device", from the start. Everything else might work at the beginning but the performance degrades during time, when both used space and fragmentation increases.


Well, it works... as expected.

You knew that in beforehand, right? PBS needs IOPS! https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements
I don't disagree spinning rust is slow... but I can do pretty much anything else I need with this homelab without any issue. Of note, a buddy of mine with a similar setup is getting GC times in the low single digit hours. I have not compared his jobs with mine, so its entirely possible I have way more backups than he does... but this still seems fairly mundane. I can't imagine how it can take over a day to GC a ~250GB dataset. I definitely agree, spinning rust is "slow" and has poor IOPS, but, 250 GB in 24 hours is insane...

But, I have considered adding a metadata special device, maybe this is a reason to decide to finally do it.
 
  • Like
Reactions: UdoB
Your second issue is that the NFS mount. network file systems are known to perform bad with PBS see https://forum.proxmox.com/threads/datastore-performance-tester-for-pbs.148694/
There was another discussion with the PBS developers and the author of said thread where it was pointed out, that some of his assumptions for his performance testing tool were wrong, but the developers agreed with the results (that you really don't want to use NFS or some other network filesystem as PBS datastore).

You have following options:
  • Create a PBS lxc container on your TrueNAS see https://forum.proxmox.com/threads/pbs-on-truenas-have-your-cake-and-eat-it-too.162860/
    Sadly TrueNAS container support is still experimental and might be ditched or changed later, but performancewise this is propably the best course of action. And the IOPS HDDs will still be bad but you won't hurt that much by the NFS overhead
  • Use an ISCSI share instead of NFS: https://jrehkemper.de/content/linux/proxmox/truenas-iscsi-storage-for-proxmox-backup-server/
    The performance still won't be great but should be better than with NFS
  • Since your backups size is still quite small you could buy two used cheap datacenter SSDs from ebay or some reseller and use them as dedicated discs for your PBS
  • Like Udo said you will get better IOPS with a striped mirror out of your 4x10 TB discs. How are they configured at the moment (RAIDZ)? Which is the capacity of your ZFS pool? RAIDZ1, RAIDZ2 and striped mirror with four 10 TB discs actually dont differ much in capacity, https://www.truenas.com/docs/references/zfscapacitycalculator/ gives 18,063 TiB for a 2-way-mirror, 18,045 TiB for RAIDZ1 (one spare) or 26,304 TiB (RAIDZ1 without spare), 17, 494 TB (RAIDZ2 0 spare)
  • Again as Udo said a special device mirror out of used server ssds will help a lot with garbage collection. Another benefit is that it will also speed up any operations with other files on the pool (you will need to rewrite the data though).
For the third and fourth option you would need to backup the data on your pool somewhere else, recreate the ZFS pool and afterwards restore the data to it.

Personally I would go with two used server ssds (they go around for around 50 Euro per 480 GB in Germany at the moment) passed through to the PBS. Additionally I would buy two-three more used server ssds and add them as special device to the HDD pool and add an ISCSI share as second datastore as additional copy ( PBS can sync from one datastore to the other).
If this isn't feasible I would rebuilding the ZFS pool as striped mirrors plus (if budget allows) a special device and use a ISCSI as datastore.
 
  • Like
Reactions: ligistx and UdoB