I'm running a Proxmox Backup Server on my Terramaster F2-423 with 32GB of RAM. I've got two NVMe drives in the system - one for the Proxmox VE host and the other for Ceph storage. My main storage consists of two 8TB Seagate CRM drives in a ZFS mirror configuration. I've passed these through to a PBS installation running in an LXC container, and this storage pool isn't shared with any other VMs or containers.
Right now, I'm using about 56% of my allocated Datastore storage (2.10 TB out of 3.76 TB). My main issue is with the Garbage Collection process in PBS. It often takes many hours to complete and seems to get "stuck" at 98% to 99%. I've seen this behavior consistently, with that final few percent taking forever to finish. To help troubleshoot, I recently ran a zpool scrub, which took about 5 hours and 10 minutes and after that, I started another Garbage Collection run. It did complete faster than previous runs, but still had a noticeable delay at the end.
I'm thinking this could be due to a few things. Maybe it's how ZFS and PBS interact, or write amplification from ZFS's copy-on-write combined with PBS's deduplication. It could also be resource limitations on my Terramaster. I'm also wondering if fragmentation might be an issue as my storage usage increases. One limitation I have is that I can't use my NVMe drives for caching since they're already allocated, so I'm relying entirely on the spinning disks for performance.
Can anyone explain how this run is mostly okay, but then a very long time to GC the last few index files? Has anyone encountered similar issues or have suggestions for improving this situation?
Right now, I'm using about 56% of my allocated Datastore storage (2.10 TB out of 3.76 TB). My main issue is with the Garbage Collection process in PBS. It often takes many hours to complete and seems to get "stuck" at 98% to 99%. I've seen this behavior consistently, with that final few percent taking forever to finish. To help troubleshoot, I recently ran a zpool scrub, which took about 5 hours and 10 minutes and after that, I started another Garbage Collection run. It did complete faster than previous runs, but still had a noticeable delay at the end.
I'm thinking this could be due to a few things. Maybe it's how ZFS and PBS interact, or write amplification from ZFS's copy-on-write combined with PBS's deduplication. It could also be resource limitations on my Terramaster. I'm also wondering if fragmentation might be an issue as my storage usage increases. One limitation I have is that I can't use my NVMe drives for caching since they're already allocated, so I'm relying entirely on the spinning disks for performance.
Can anyone explain how this run is mostly okay, but then a very long time to GC the last few index files? Has anyone encountered similar issues or have suggestions for improving this situation?
Code:
[...]
2024-09-30T17:08:41+10:00: marked 90% (234 of 260 index files)
2024-09-30T17:08:41+10:00: marked 91% (237 of 260 index files) <- 3 index files in 1 sec
2024-09-30T17:08:42+10:00: marked 92% (240 of 260 index files) <- 3 index files in 1 sec
2024-09-30T17:08:47+10:00: marked 93% (242 of 260 index files) <- 2 index files in 5 secs
2024-09-30T17:12:05+10:00: marked 94% (245 of 260 index files) <- 3 index files in 3:18 mins
2024-09-30T17:12:05+10:00: marked 95% (247 of 260 index files) <- 2 index files in 0 secs
2024-09-30T17:12:47+10:00: marked 96% (250 of 260 index files) <- 3 index files in 42 secs
2024-09-30T17:13:47+10:00: marked 97% (253 of 260 index files) <- 3 index files in 1 min
2024-09-30T17:37:25+10:00: marked 98% (255 of 260 index files) <- 2 index files in ~21 mins
2024-09-30T17:51:52+10:00: marked 99% (258 of 260 index files) <- 3 index files in ~13 mins
2024-09-30T18:00:22+10:00: marked 100% (260 of 260 index files) <- 2 index files in ~9 mins
2024-09-30T18:00:22+10:00: Start GC phase2 (sweep unused chunks)
2024-09-30T18:00:30+10:00: processed 1% (8813 chunks)
2024-09-30T18:00:37+10:00: processed 2% (17629 chunks)
2024-09-30T18:00:43+10:00: processed 3% (26463 chunks)
[...]
024-09-30T18:20:16+10:00: processed 98% (855508 chunks)
2024-09-30T18:20:38+10:00: processed 99% (864290 chunks)
2024-09-30T18:20:59+10:00: Removed garbage: 1.397 GiB
2024-09-30T18:20:59+10:00: Removed chunks: 1002
2024-09-30T18:20:59+10:00: Original data usage: 14.544 TiB
2024-09-30T18:20:59+10:00: On-Disk usage: 1.899 TiB (13.05%)
2024-09-30T18:20:59+10:00: On-Disk chunks: 871911
2024-09-30T18:20:59+10:00: Deduplication factor: 7.66
2024-09-30T18:20:59+10:00: Average chunk size: 2.283 MiB
2024-09-30T18:20:59+10:00: TASK OK
Total time: 3h 15m 9.3s