I’ve started testing out proxmox backup server more seriously over the past few months and I had a question regarding garbage collection. It’s more about confirming my setup and my suspicions of what is increasing garbage collection runtimes with more experienced PBS users. Details of my setup are further down.
From my understanding, garbage collection is done in two steps:
- first to identify the chunks still in use and new chunks added
- second to identify chunks no longer used that can be reclaimed.
I noticed that the first phase takes the longest to complete no matter how much data is stored. If the disks are fairly empty then GC takes minutes to a few hours. But as the PBS datastore fills up with backups and the number of chunks increases, GC takes longer and longer. In all cases, the first phase takes substantially more time to complete, ~30x-100x longer than phase 2, regardless of how long the actual GC takes. The numbers depend on how much data was added and how much data can be freed and vary widely.
For example, the most recent garbage collection took just over 19hours:
phase 1: 18h33min, phase 2: 0h37min (~30x difference)
A fairly early configuration with few backups and much fewer used chunks produced a GC run of 1h41min:
phase 1: 1h38min, phase 2: 0h03min (~33x difference)
There are other examples where the ratio is indeed closer to 100x.
I have a setup that is far from ideal and not recommended for production use:
- PBS is a VM running on PVE bare metal
- TrueNAS is also a VM running on the same PVE
- the datastore is actually a dataset on TrueNAS and shared via NFS to PBS
- the TrueNAS server is running zfs and the default lz4 compression is enabled
- the TrueNAS pool is using hard drives in RaidZ1 instead of enterprise ssd’s so there is a major iops disadvantage
- my budget was constrained for this build and it was mainly for testing.
I’m wondering if filling up the other TrueNAS datasets affects the garbage collection times for PBS since the PBS datastore resides in the same pool as those other datasets (and therefore accesses the pool’s global free space). If PBS were the only data in the pool/dataset, would that change anything? Is it better to have dedicated disks/ssds as a separate pool reserved for PBS, or even configuring/passing through some drives directly to PBS?
I suspect dedicating some disks/ssds exclusively to PBS will probably help things somewhat and I will try this as soon as time permits. I know this won’t resolve the iops issue with spinning drives, unless I get large enterprise ssds. I may choose that route if I’m going to make the effort of moving the datastore to dedicated disks.
Because my configuration is non-standard, I couldn’t find recommendations in the proxmox documentation for my specific setup, only that PBS should use ssds because of the large iops and the amount of processing required by PBS garbage collection and pruning.
Extra details of my system/setup:
- PVE host is installed on one ssd in ext4 format (I later realized a mirrored zfs would have been better)
- PBS is running as a virtual machine inside the PVE host (6 cores as [host], memory 8-32gb ballooning)
- TrueNAS Scale VM is also running on the PVE host (4 cores as [host], memory 32gb balloon=0)
- I’ve passed through 5x4TB drives to TrueNAS as a RaidZ1 pool and created a few datasets, one of which being the pbs datastore
- My PBS datastore used for the backups is shared over NFS from the TrueNAS VM
- The other TrueNAS datasets are used to store other data, some other unrelated backups, some media files
- PVE, PBS and TrueNAS are updated to the latest versions
Hardware:
- PVE is running on an HP z620 workstation
- processors: 2x E5-2620v2 (12 cores/24 threads)
- ram: 96GB DDR3 ECC buffered/registered
From my understanding, garbage collection is done in two steps:
- first to identify the chunks still in use and new chunks added
- second to identify chunks no longer used that can be reclaimed.
I noticed that the first phase takes the longest to complete no matter how much data is stored. If the disks are fairly empty then GC takes minutes to a few hours. But as the PBS datastore fills up with backups and the number of chunks increases, GC takes longer and longer. In all cases, the first phase takes substantially more time to complete, ~30x-100x longer than phase 2, regardless of how long the actual GC takes. The numbers depend on how much data was added and how much data can be freed and vary widely.
For example, the most recent garbage collection took just over 19hours:
phase 1: 18h33min, phase 2: 0h37min (~30x difference)
A fairly early configuration with few backups and much fewer used chunks produced a GC run of 1h41min:
phase 1: 1h38min, phase 2: 0h03min (~33x difference)
There are other examples where the ratio is indeed closer to 100x.
I have a setup that is far from ideal and not recommended for production use:
- PBS is a VM running on PVE bare metal
- TrueNAS is also a VM running on the same PVE
- the datastore is actually a dataset on TrueNAS and shared via NFS to PBS
- the TrueNAS server is running zfs and the default lz4 compression is enabled
- the TrueNAS pool is using hard drives in RaidZ1 instead of enterprise ssd’s so there is a major iops disadvantage
- my budget was constrained for this build and it was mainly for testing.
I’m wondering if filling up the other TrueNAS datasets affects the garbage collection times for PBS since the PBS datastore resides in the same pool as those other datasets (and therefore accesses the pool’s global free space). If PBS were the only data in the pool/dataset, would that change anything? Is it better to have dedicated disks/ssds as a separate pool reserved for PBS, or even configuring/passing through some drives directly to PBS?
I suspect dedicating some disks/ssds exclusively to PBS will probably help things somewhat and I will try this as soon as time permits. I know this won’t resolve the iops issue with spinning drives, unless I get large enterprise ssds. I may choose that route if I’m going to make the effort of moving the datastore to dedicated disks.
Because my configuration is non-standard, I couldn’t find recommendations in the proxmox documentation for my specific setup, only that PBS should use ssds because of the large iops and the amount of processing required by PBS garbage collection and pruning.
Extra details of my system/setup:
- PVE host is installed on one ssd in ext4 format (I later realized a mirrored zfs would have been better)
- PBS is running as a virtual machine inside the PVE host (6 cores as [host], memory 8-32gb ballooning)
- TrueNAS Scale VM is also running on the PVE host (4 cores as [host], memory 32gb balloon=0)
- I’ve passed through 5x4TB drives to TrueNAS as a RaidZ1 pool and created a few datasets, one of which being the pbs datastore
- My PBS datastore used for the backups is shared over NFS from the TrueNAS VM
- The other TrueNAS datasets are used to store other data, some other unrelated backups, some media files
- PVE, PBS and TrueNAS are updated to the latest versions
Hardware:
- PVE is running on an HP z620 workstation
- processors: 2x E5-2620v2 (12 cores/24 threads)
- ram: 96GB DDR3 ECC buffered/registered