Slow garbage collection and failed backups

aeop99

New Member
Sep 22, 2025
2
0
1
Hey everyone,

We're experiencing issues with garbage collection being slow. In some cases, this is causing backups to fail and also resulting in backups taking longer than expected to complete.

Our set up is not ideal as we're using network attached storage and not using SSD's, but it was working fine for 1-2 months and so we're unsure what is causing the issue. As mentioned, we're using network attached storage (/mnt) for the datastore, and this points to our NAS using Synology HAT5300-16T drives which is connected locally to our dedicated Proxmox Backup Server. The server itself is using dedicated hardware, but does not have any other storage other than the 2 drives used for the OS.

On the datastore, we are currently using ~60TB of storage (out of ~112TB) and GC is taking upwards of 30 days to complete. The first phase is quite quick, but the second phase seems to process ~3% a day. While GC is processing/running, our backups seemingly take 4x as long to complete, with one particular VM going from 30 minutes to 2 hours for the backup to complete. For some VM's, particularly those that are large (several TB in size), this also seems to be causing them to go offline for 1-2 minutes when a backup starts and GC is running.

I do not believe this is an issue on the dedicated server itself as it has sufficient (overkill) hardware, and we're not seeing it get anywhere near close to using all of the resources. My suspicion is that using network attached storage and slow(ish) HDD's is our problem. However, as this was working previously fine, I was expecting no issues. Could the number of chunks be causing issues now? The last GC reported 42198447 chunks.

We previously ran into this exact issue and were able to create a new datastore on a separate NAS and started storing our backups here. This NAS had very similar specs, and doing this did resolve the issue briefly. However, after 1-2 months, we started seeing the exact same issue again and so it seems we are able to replicate it. Again, my suspicion here is that the number of chunks stored is causing us issues and using network attached storage with HDD's isn't helping. However, does anyone have any suggestions on how we can resolve this without having to purchase SSD's?

We are also noticing some backups fail with the task log showing:
Code:
2025-09-22T06:02:28+10:00: backup ended and finish failed: backup ended but finished flag is not set.
2025-09-22T06:02:28+10:00: removing unfinished backup
2025-09-22T06:02:28+10:00: removing backup snapshot "/mnt/nas/ns/6Hour/vm/297/2025-09-21T20:00:01Z"
2025-09-22T06:02:28+10:00: TASK ERROR: removing backup snapshot "/mnt/nas/ns/6Hour/vm/297/2025-09-21T20:00:01Z" failed - Directory not empty (os error 39)

PBS information -
Version: 3.4.2
CPU: Intel Xeon E5-2680 (2 sockets)
RAM: 338GB

Thanks in advance :)
 
This has been discussed dozens of times already. That setup with those backup sizes won't ever perform well. You are using the two things that kill PBS GC performance: network shares and HDD only datastore. Your datastore size requires proper deployment.

Every GC has to touch every single chunk to update it's timestamp and remove it if it is older than 24h05m (no longer used). Roughly speaking, for each chunk, add the network latency RTT for access/modify/delete operation. Then add the HDD I/O latency and commit times for very small block sizes (GC essentially changes metadata). Then add the NAS itself processing times. In the mean time, you want to still do backups and maybe verifies, which also need I/O. Your storage infrastructure simply can't keep up for that workload.

If you are serious about your backups, ditch those NAS and get a server were you can place your HDD and some SSD. Use RAID10 ZFS + Special Device and enjoy a good balance between performance, capacity and price. RAIDz is an option too, but it's slower. Another option if your NAS allows is to isntall PBS as a VM: it will be faster because you are removing the network latency to access the HDDs.

Two things you can do now:
  • Try to increase gc-cache-capacity: proxmox-backup-manager datastore update <DATASTORE> --tuning 'gc-cache-capacity=8388608'. That may cut your GC times a few days.
  • Use fleecing device on PVE backup tasks. That will act as local buffer so even if your PBS is overloaded, the VMs I/O won't be affected.
 
This has been discussed dozens of times already. That setup with those backup sizes won't ever perform well. You are using the two things that kill PBS GC performance: network shares and HDD only datastore. Your datastore size requires proper deployment.

Every GC has to touch every single chunk to update it's timestamp and remove it if it is older than 24h05m (no longer used). Roughly speaking, for each chunk, add the network latency RTT for access/modify/delete operation. Then add the HDD I/O latency and commit times for very small block sizes (GC essentially changes metadata). Then add the NAS itself processing times. In the mean time, you want to still do backups and maybe verifies, which also need I/O. Your storage infrastructure simply can't keep up for that workload.

If you are serious about your backups, ditch those NAS and get a server were you can place your HDD and some SSD. Use RAID10 ZFS + Special Device and enjoy a good balance between performance, capacity and price. RAIDz is an option too, but it's slower. Another option if your NAS allows is to isntall PBS as a VM: it will be faster because you are removing the network latency to access the HDDs.

Two things you can do now:
  • Try to increase gc-cache-capacity: proxmox-backup-manager datastore update <DATASTORE> --tuning 'gc-cache-capacity=8388608'. That may cut your GC times a few days.
  • Use fleecing device on PVE backup tasks. That will act as local buffer so even if your PBS is overloaded, the VMs I/O won't be affected.
Thank you. I assumed this would be the case but thought I'd ask to confirm. We actually tried running PBS as a VM on the NAS, but ran into similar issues. We're already using fleecing, but I'll try and increase the cache capacity in the meantime and see if it makes a difference.

In any case, it looks like we'll have no choice but to get some enterprise SSD's for PBS and install these directly in the server.