[SOLVED] Garbage collector too slow.

Hi,
We have a PBS installed over a virtual machine that access data stores located in an NFS server over the Proxmox in which the PBS virtual machine is running.
We have several data stores in this PBS used for backups from different Proxmox systems.

The virtual machines backups from the Proxmox systems are going fine but the garbage collector process for free space is running extremely slow. Now, we have a garbage collector running in one of the data stores for more than 7 days and it has not finished yet:
Bash:
2021-03-27T18:49:53+01:00: marked 52% (604 of 1160 index files)
2021-03-27T20:00:50+01:00: marked 53% (615 of 1160 index files)
2021-03-27T23:19:43+01:00: marked 54% (627 of 1160 index files)
2021-03-28T09:55:15+02:00: marked 55% (638 of 1160 index files)
2021-03-28T10:51:10+02:00: marked 56% (650 of 1160 index files)
2021-03-28T11:34:35+02:00: marked 57% (662 of 1160 index files)
2021-03-28T19:10:24+02:00: marked 58% (673 of 1160 index files)
2021-03-29T09:12:09+02:00: marked 59% (685 of 1160 index files)
2021-03-29T15:33:53+02:00: marked 60% (696 of 1160 index files)
2021-03-29T17:13:12+02:00: marked 61% (708 of 1160 index files)
2021-03-29T17:55:15+02:00: marked 62% (720 of 1160 index files)
2021-03-29T18:24:31+02:00: marked 63% (731 of 1160 index files)
2021-03-29T19:01:35+02:00: marked 64% (743 of 1160 index files)
2021-03-29T19:39:07+02:00: marked 65% (754 of 1160 index files)

Due to this, we will get our datastore fulls in some days, because we are not able to remove the pruned data.

Why is happening this?
What can we do for speed up the garbage collection process?

Thank you very much for your help.
 
Last edited:
Hi!

Now, we have a garbage collector running in one of the data stores for more than 7 days and it has not finished yet
What raw data amount are we talking about here?
Also, what are the Hardware specs like of the PBS server and the NFS box?

Why is happening this?
Probably because the metadata access on that NFS export is (relatively) slow, if it's done for a hundred of thousand of chunks.

What can we do for speed up the garbage collection process?
Effectively metadata access needs to be faster (less latency and probably higher bandwidth too).
A bigger page cache (more available memory) can help a bit, but in the end nothing beats local SSD, as with lots of metadata access you get lots of small request on different mostly random places. While bandwidth comparisons between spinner and SSDs do not seem that bad, the random IO is much more key and here SSDs can be faster by factors of 100s to even 1000s and above.
 
What raw data amount are we talking about here?
Also, what are the Hardware specs like of the PBS server and the NFS box?
Hi,
We are talking about 10TB of data for PBS.
Its a hardware server with Proxmox installed in it and the PBS is running as a virtual machine.
It has 31TB of space with several disks in RAID with ZFS.

If we install the PBS directly into the Proxmox server instead as a virtual machine and avoiding the NFS overload, I think that we will get an important performance improvement, isn't it?


Probably because the metadata access on that NFS export is (relatively) slow, if it's done for a hundred of thousand of chunks.

Effectively metadata access needs to be faster (less latency and probably higher bandwidth too).
A bigger page cache (more available memory) can help a bit, but in the end nothing beats local SSD, as with lots of metadata access you get lots of small request on different mostly random places. While bandwidth comparisons between spinner and SSDs do not seem that bad, the random IO is much more key and here SSDs can be faster by factors of 100s to even 1000s and above.
Yes we agree, using SSDs will increase performance but having 31TB of space for backup purposes with SSD disks is very expensive.

Nevertheless, we don't understand why the copy process goes fine but the delete process requires too much time.
It is possible to make the delete process more efficient?
Which is the reason for which it takes too much time?

Thanks!
 
It has 31TB of space with several disks in RAID with ZFS.
The RAID configuration would be also interesting. Any fast special device mirror, or log device setup?

How much memory can be used for the ZFS ARC?

If we install the PBS directly into the Proxmox server instead as a virtual machine and avoiding the NFS overload, I think that we will get an important performance improvement, isn't it?
It removes a layer of indirection, so it def. should not get slower;, But, I'd not get my hopes up too big for a huge performance improvement.

Yes we agree, using SSDs will increase performance but having 31TB of space for backup purposes with SSD disks is very expensive.
But they also save lots of costs and are in general more reliable (no moving parts). You do not need a high performance SSD, basically any SSD will beat a spinner in random IOPS and still deliver at least the same bandwidth.

For example, I could get a Micron 5210 ION 7.68 TB here for ~700 € (exclusive VAT), for a 30 TB system I'd need four drives, as some redundancy is a must I'd go with either 6 and RAID-Z2 or 8 and RAID10, the latter is performance wise and during resilver much better also easier to extent, but bumps up the cost too; so depends on my workload and the time I plan that this setup should operate too.

With 6 drives and RAID-Z I'm investing 4200 € in drives, of which I may amortize even parts of it when doing taxes, and that for a setup lifetime of 7 years minimum, so effectively an investment of 600 €/year in storage costs, if I remove the amortized value and the person-hour time spent working/debugging/tuning a slow system you may even come out with a gain compared to the much cheaper spinner system.
With 8 drives, RAID10 it's 5600€, so 800€/year without factoring in tax and less-time spent recuperations.

Restore times will also dramatically improve with SSDs, so if you have a failure and need to restore a critical VM/CT/... then requiring 30 minutes instead of 3-5 hours can be quite the life-saver (just an example but order of magnitude should be about right).

Nevertheless, we don't understand why the copy process goes fine but the delete process requires too much time.
It is possible to make the delete process more efficient?
Which is the reason for which it takes too much time?
With copy, you mean backups, or?
Backup just needs to write the chunks that do not exist already, it does not need to care about the global datastore state and other writers.
For example, if there are 1000 4 MiB chunks in a VM backup of which 900 stayed the same then it just needs to do a simple file exists check for 900 of them, and write the 100 new - so in sum: all not much work and pretty fast to do.

But GC isn't such a local process, it affects the whole datastore, and as we just cannot block all backup process for X hours, so we need a scheme where backup writer can continue in parallel. That means PBS needs to iterate over all backup indexes get the chunk list they use, then iterate over all those still used chunks and touch them (mtime update) to mark them as still used, that's phase 1. In phase 2 we can go over all chunks in the datastore, see if their mtime is new enough, and else delete it.
Any optimization that would actually gain something here would mean multiple GiB (depending on storage) of additional, permanently allocated, memory usage. And for those setups which actually already have lots of memory available the performance should be already improved now due to a bigger page cache.

10 TiB of data is not less but also not overly huge, would be 2.6 - 3 million chunks, so requiring over 7 days for GC seems like a rather slow (in terms of IOPS) storage.
 
Hi,
Ok, thank you very much for the supplied information.
We are going to make some tests and try to improve the GC performance.
 
  • Like
Reactions: meichthys
Hi,
We have installed the PBS software directly together with the Proxmox server and removed the PBS virtual machine, avoiding this way the use of the NFS server. We have modified too the ZFS ARC option zfs_arc_max for use the half of the server's memory (16GB).

Now the garbage collector procedure completes in hours instead of days. Great!!! :)
Usually the longest garbage collector task taken more than 7 days to complete, now it completes in 4-5 hours.

We are thinking about the possibility of use L2ARC cache based upon SSD disks, maybe it will even increase more the I/O performance.

Thank you very much for great advices and help.
 
>We are thinking about the possibility of use L2ARC cache based upon SSD disks, maybe it will even increase more the I/O performance.

i think, it does.

i have l2arc on 120gb/20 euro consumer grade ssd and have set secondarycache=metadata . after l2arc is being filled/warm, i get constant >90% metadata hits on datastore verify operation.

root@pbs01:/root# arcstat -f mread,mhit,mmis,mh% 10
mread mhit mmis mh%
0 0 0 100
639 621 17 97
604 584 20 96
570 542 28 95
596 568 28 95
575 535 39 93
617 583 34 94
490 460 29 93
497 470 27 94
582 562 19 96
549 529 19 96
581 557 24 95
623 597 25 95
548 518 30 94
 
Last edited:
If you want to use L2ARC so that a SSD caches your pool metadata, why not just using SSDs as special devices?
 
because you need mirrored ssd for this and i would not trust cheap ssd for expanding your backup pool to that. you have your backup data distributed on ssd+hdd and problem with those ssd affects your whole backup.

for metadata l2arc, consumer grade el cheapo ssd should be sufficient. it's just a caching device you could remove at any time and which won't kill your data on defect.
 
But GC isn't such a local process, it affects the whole datastore, and as we just cannot block all backup process for X hours, so we need a scheme where backup writer can continue in parallel. That means PBS needs to iterate over all backup indexes get the chunk list they use, then iterate over all those still used chunks and touch them (mtime update) to mark them as still used, that's phase 1. In phase 2 we can go over all chunks in the datastore, see if their mtime is new enough, and else delete it.

Maybe. One possible optimisation could be that GC still works as-is, but instead of doing `touch` on a file, it does that in the lookup table in memory. Then the `.chunks` traversal for phase 2 should be way quicker operation to get `mtime` only of a files not present in lookup table, and do `touch` only on files present in lookup.

Assuming that the `.chunks` is 30TB, and average chunk size is on lower end to 2.5MB, this gives over 12M of chunks, and assuming that we can somehow have a cost of lookup to be at max 64 bytes (the single identifier seems to be 32 bytes (sha256?)) it would make to use up to 800MB for the duration of GC only, with a most of time a single I/O operation in phase 2 (except delete which would require two, but delete is rather infrequent). This would be especially beneficial for many backups re-using the same chunks as their chunks would not be touched over and over again.

I assume the above on looking at GC code somewhere 1 year ago, so maybe some optimisation might have been done since that time.
 
Last edited:
  • Like
Reactions: RolandK
PBS is doing a lot of random reads/writes but data is usually 1-4M and not 4K. So the internal write amplification of the SSD won't be that bad. Whats a lot of small random IO is metadata and you could buy some smaller but faster and more durable SSDs and use them as a ZFS special device to only store that metadata.
 
  • Like
Reactions: rigel.local
PBS is doing a lot of random reads/writes but data is usually 1-4M and not 4K. So the internal write amplification of the SSD won't be that bad. Whats a lot of small random IO is metadata and you could buy some smaller but faster and more durable SSDs and use them as a ZFS special device to only store that metadata.
Thank you so much for this insight and great idea!! Can you also suggest what would be better for rhis kind of QLC "enterprise" ssd like Micron ION above in terms of ZFS pool configuration?

Let's say I have 12 drives. Is it better to run them as:
a) stripe raid 0 type giving effective 12 drives
b) 3 VDEVs with 4 drives each in raidz1 giving effective 9 drives
c) 6 VDEVS with 2 drives each in mirror giving effective 6 drives

Considering I can afford to run them in any configuration what will be the best in terms of PBS performance?
 
You always have to balance reliability, performance and capacity. Raid0 of cause would be great for performance and best for capacity but horrible for reliability.
With HDDs, I wouldn't use anything other than a striped mirror to make the most of the IOPS performance. With SSDs a (striped) raidz1 or raidz2 might be fine, depending on the disks and your requirements.
 
  • Like
Reactions: rigel.local
Have you read Micron 5210 ION specs? Do you really recommend a QLC drive with 0.05-0.16 DWPD endurance for backups that are so random writes heavy?
As Dunuin wrote, PBS is normally writing bigger chunks, and thanks to deduplication also only the new ones, so there might not be as many write OPs as one may expect. Besides that, spinning HDDs have lots more parts that can fail, so a QLC might not be worse here.
And the model was just an example, nowadays, I'd probably use a U.2 or U.3 connected driver (modern HW supports them out-of-the-box, otherwise there are PCIe adapters). E.g., the U.2 Samsung PM9A3, the U.2 Solidigm (former Intel) D7-P5510 or the U.3 Micron 7450 PRO, all TLC and costing around €390 to €440 (excl. VAT). There are also 15.36 TB (~ €850 excl. VAT) and even some 30.72 TB models available from those vendors.

3x PM9A3 15.36 disks for a RAID-Z1 or 4x for a RAID-10 would give one 30 TB usable, space like in the original example, for ~ €2550 or €3400, respectively – here the extra disk is only for uptime, having a (bigger, but slower and cheaper) offsite PBS mirror, or tape backup, would be needed in any way to actually secure the data for a local fallout. That's about ~ €1650 cheaper than two years ago (comparing the RAID-Z1 with the RAID-Z2 then), and IMO further speaking for using SSDs nowadays.

To be sure: while I did use a U.2 disk via PCIe adapter personally don't see above as failsafe recommendation, test things for your actual HW. And yes, taking a good look at any datasheet and comparing durability with the IO loads one (roughly) expects from their use case is definitively a good idea.
 
  • Like
Reactions: rigel.local
As Dunuin wrote, PBS is normally writing bigger chunks, and thanks to deduplication also only the new ones, so there might not be as many write OPs as one may expect. Besides that, spinning HDDs have lots more parts that can fail, so a QLC might not be worse here.
And the model was just an example, nowadays, I'd probably use a U.2 or U.3 connected driver (modern HW supports them out-of-the-box, otherwise there are PCIe adapters). E.g., the U.2 Samsung PM9A3, the U.2 Solidigm (former Intel) D7-P5510 or the U.3 Micron 7450 PRO, all TLC and costing around €390 to €440 (excl. VAT). There are also 15.36 TB (~ €850 excl. VAT) and even some 30.72 TB models available from those vendors.

3x PM9A3 15.36 disks for a RAID-Z1 or 4x for a RAID-10 would give one 30 TB usable, space like in the original example, for ~ €2550 or €3400, respectively – here the extra disk is only for uptime, having a (bigger, but slower and cheaper) offsite PBS mirror, or tape backup, would be needed in any way to actually secure the data for a local fallout. That's about ~ €1650 cheaper than two years ago (comparing the RAID-Z1 with the RAID-Z2 then), and IMO further speaking for using SSDs nowadays.

To be sure: while I did use a U.2 disk via PCIe adapter personally don't see above as failsafe recommendation, test things for your actual HW. And yes, taking a good look at any datasheet and comparing durability with the IO loads one (roughly) expects from their use case is definitively a good idea.
Thank you so much for such a detailed answer. Yes, PCIe U.2/U.3 SSDs with good durability are so much cheaper than SATA SSDs these days. Too bad I have limited amount of PCIe lanes. 3 x 15.36TB U.2/U.3 SSDs sounds like the best option in terms of cost, just somehow need to find these extra 4 + 4 + 4 PCIe lanes on my motherboard.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!