Strange Garbage Collection times

markc · Sep 30, 2024

I'm running a Proxmox Backup Server on my Terramaster F2-423 with 32GB of RAM. I've got two NVMe drives in the system - one for the Proxmox VE host and the other for Ceph storage. My main storage consists of two 8TB Seagate CRM drives in a ZFS mirror configuration. I've passed these through to a PBS installation running in an LXC container, and this storage pool isn't shared with any other VMs or containers.

Right now, I'm using about 56% of my allocated Datastore storage (2.10 TB out of 3.76 TB). My main issue is with the Garbage Collection process in PBS. It often takes many hours to complete and seems to get "stuck" at 98% to 99%. I've seen this behavior consistently, with that final few percent taking forever to finish. To help troubleshoot, I recently ran a zpool scrub, which took about 5 hours and 10 minutes and after that, I started another Garbage Collection run. It did complete faster than previous runs, but still had a noticeable delay at the end.

I'm thinking this could be due to a few things. Maybe it's how ZFS and PBS interact, or write amplification from ZFS's copy-on-write combined with PBS's deduplication. It could also be resource limitations on my Terramaster. I'm also wondering if fragmentation might be an issue as my storage usage increases. One limitation I have is that I can't use my NVMe drives for caching since they're already allocated, so I'm relying entirely on the spinning disks for performance.

Can anyone explain how this run is mostly okay, but then a very long time to GC the last few index files? Has anyone encountered similar issues or have suggestions for improving this situation?

Code:

[...]
2024-09-30T17:08:41+10:00: marked 90% (234 of 260 index files)
2024-09-30T17:08:41+10:00: marked 91% (237 of 260 index files) <- 3 index files in 1 sec
2024-09-30T17:08:42+10:00: marked 92% (240 of 260 index files) <- 3 index files in 1 sec
2024-09-30T17:08:47+10:00: marked 93% (242 of 260 index files) <- 2 index files in 5 secs
2024-09-30T17:12:05+10:00: marked 94% (245 of 260 index files) <- 3 index files in 3:18 mins
2024-09-30T17:12:05+10:00: marked 95% (247 of 260 index files) <- 2 index files in 0 secs
2024-09-30T17:12:47+10:00: marked 96% (250 of 260 index files) <- 3 index files in 42 secs
2024-09-30T17:13:47+10:00: marked 97% (253 of 260 index files) <- 3 index files in 1 min
2024-09-30T17:37:25+10:00: marked 98% (255 of 260 index files) <- 2 index files in ~21 mins
2024-09-30T17:51:52+10:00: marked 99% (258 of 260 index files) <- 3 index files in ~13 mins
2024-09-30T18:00:22+10:00: marked 100% (260 of 260 index files) <- 2 index files in ~9 mins
2024-09-30T18:00:22+10:00: Start GC phase2 (sweep unused chunks)
2024-09-30T18:00:30+10:00: processed 1% (8813 chunks)
2024-09-30T18:00:37+10:00: processed 2% (17629 chunks)
2024-09-30T18:00:43+10:00: processed 3% (26463 chunks)
[...]
024-09-30T18:20:16+10:00: processed 98% (855508 chunks)
2024-09-30T18:20:38+10:00: processed 99% (864290 chunks)
2024-09-30T18:20:59+10:00: Removed garbage: 1.397 GiB
2024-09-30T18:20:59+10:00: Removed chunks: 1002
2024-09-30T18:20:59+10:00: Original data usage: 14.544 TiB
2024-09-30T18:20:59+10:00: On-Disk usage: 1.899 TiB (13.05%)
2024-09-30T18:20:59+10:00: On-Disk chunks: 871911
2024-09-30T18:20:59+10:00: Deduplication factor: 7.66
2024-09-30T18:20:59+10:00: Average chunk size: 2.283 MiB
2024-09-30T18:20:59+10:00: TASK OK

Total time: 3h 15m 9.3s

Chris · Sep 30, 2024

Hi,
this probably is related to the snapshots index files. If these index a lot of chunks, because they contain large disks or filesystems, than the progress will be slower for these, as all the indexed chunks have to be touched in order to update the atime of the chunk. I do not see any issue.

markc · Sep 30, 2024

This particular example was not too bad, only ~3 hours for the entire GC run, whereas previous runs would take 5 to 7 hours and always with many hours spent on the last dozen index files. I just did a GC on a similar sized Datastore on a 4TB NVMe in an MS-01, and it took less than 5 minutes! I know, a spinning disk can't compete with a NVMe drive, but 5 minutes compared to up to 5+ hours for a similar sized Datastore seems like there is something seriously wrong with my TM F2-423 w/ 2x8TB setup.

JensF · Sep 30, 2024

This is quite usual on spinning rust and the main reason SSDs are recommended.
You could try different filesystems for your HDDs. BTRFS handles pretty great the small chunks.

VictorSTS · Sep 30, 2024

markc said:
I know, a spinning disk can't compete with a NVMe drive, but 5 minutes compared to up to 5+ hours for a similar sized Datastore seems like there is something seriously wrong with my TM F2-423 w/ 2x8TB setup.

GC is a very hard workload for an HDD. GC implies touching millions and millions of files just to update the atime, which for an HDD implies moving the heads at least 2 times for each file. Given that the typical search times for an HDD are around ~8ms vs the ~0.05ms seek time of a SATA SSD (NVME are usually in the micro seconds values) perfectly explains the time difference.

That's why using ZFS + Special Device is recommended (and even needed if the datastore is big enough) [1].

[1] https://pbs.proxmox.com/docs/sysadmin.html#zfs-special-device

markc · Sep 30, 2024

Thanks to Chris and VictorSTS for the explanation of how GC works. I just did another manual GC run after a remote sync plus a prune. This time was indeed 7h 23m, with some index files taking up to an hour, while most were a few seconds. This run got to 97% of phase1 in under 5 mins, then really "stalled" for the next 7 hours. Such a staggering difference that the power consumption (and heat) of a pair of spinning drives going flat out for many hours compared to an NVMe drive doing the same amount of work in 5 minutes is something to consider when scaling up. If I get a chance to compare this same host with a Special Device (not likely) then that would be interesting, but otherwise consider me an "all-flash" convert.

Code:

[...]
2024-09-30T22:45:16+10:00: marked 96% (279 of 290 index files)
2024-09-30T22:45:17+10:00: marked 97% (282 of 290 index files) <- 3 index files in 1 sec
2024-09-30T23:40:05+10:00: marked 98% (285 of 290 index files) <- 3 index files in ~5 min
2024-09-30T23:51:00+10:00: marked 99% (288 of 290 index files) <- 3 index files in 1h 11m
2024-09-30T23:57:00+10:00: marked 100% (290 of 290 index files) <- 2 index files in ~6 min
2024-09-30T23:57:00+10:00: Start GC phase2 (sweep unused chunks)
2024-09-30T23:57:08+10:00: processed 1% (9598 chunks)
2024-09-30T23:57:14+10:00: processed 2% (19192 chunks)
2024-09-30T23:57:22+10:00: processed 3% (28837 chunks)
[...]

Search

Search

Strange Garbage Collection times

markc

Active Member

Chris

Proxmox Staff Member

markc

Active Member

JensF

Well-Known Member

VictorSTS

Famous Member

markc

Active Member