Extremely slow garbage collection

slowcar

New Member
Apr 19, 2025
8
0
1
I've got 2 machine both have PBS running as VM, both allocated 10 cores, 8gb RAM.
Machine A has AMD pro 5750G, NFS on the same physical machine mapped(Synology VM) as datastore, garbage collection takes under 1 minute.
Machine B is a Dell R740 with Xeon 6138, HDD passed through as datastore, garbage collection takes about 1 hour without much new data.
I've heard NFS datastore are typically slower, but not in my case. Any idea why it's taking so long considering the amount of data for both jobs are quite similar?

One difference I noticed is RAM usage while verifying and garbage collecting. The AMD(faster) machine used 95% RAM while the R740(slower) machine only uses about 6%-8%.

Code:
# R740 garbage collection, 59 minutes. Local HDD datastore.

2025-06-10T11:40:00+10:00: starting garbage collection on store backups
2025-06-10T11:40:00+10:00: task triggered by schedule 'tue 11:40'
2025-06-10T11:40:01+10:00: Access time update check successful, proceeding with GC.
2025-06-10T11:40:01+10:00: Using access time cutoff 1d 5m, minimum access time is 2025-06-09T01:35:00Z
2025-06-10T11:40:01+10:00: Start GC phase1 (mark used chunks)
2025-06-10T11:42:06+10:00: marked 2% (1 of 37 index files)
2025-06-10T11:42:06+10:00: marked 5% (2 of 37 index files)
2025-06-10T11:42:12+10:00: marked 8% (3 of 37 index files)
...
2025-06-10T12:09:33+10:00: marked 94% (35 of 37 index files)
2025-06-10T12:09:37+10:00: marked 97% (36 of 37 index files)
2025-06-10T12:09:37+10:00: marked 100% (37 of 37 index files)
2025-06-10T12:09:37+10:00: Start GC phase2 (sweep unused chunks)
2025-06-10T12:09:56+10:00: processed 1% (522 chunks)
2025-06-10T12:10:14+10:00: processed 2% (1046 chunks)
2025-06-10T12:10:33+10:00: processed 3% (1599 chunks)
2025-06-10T12:10:52+10:00: processed 4% (2118 chunks)
2025-06-10T12:11:10+10:00: processed 5% (2614 chunks)
2025-06-10T12:11:28+10:00: processed 6% (3107 chunks)
...
2025-06-10T12:39:38+10:00: processed 99% (50406 chunks)
2025-06-10T12:39:56+10:00: Removed garbage: 0 B
2025-06-10T12:39:56+10:00: Removed chunks: 0
2025-06-10T12:39:56+10:00: Original data usage: 757.442 GiB
2025-06-10T12:39:56+10:00: On-Disk usage: 127.482 GiB (16.83%)
2025-06-10T12:39:56+10:00: On-Disk chunks: 50897
2025-06-10T12:39:56+10:00: Deduplication factor: 5.94
2025-06-10T12:39:56+10:00: Average chunk size: 2.565 MiB
2025-06-10T12:39:56+10:00: TASK OK

Code:
#AMD Pro 5750G garbage collection, 50 seconds. NFS datastore.

2025-06-10T11:51:00+10:00: starting garbage collection on store backups-dsm
2025-06-10T11:51:00+10:00: task triggered by schedule '11:51'
2025-06-10T11:51:01+10:00: Access time update check successful, proceeding with GC.
2025-06-10T11:51:01+10:00: Using access time cutoff 1d 5m, minimum access time is 2025-06-09T01:46:00Z
2025-06-10T11:51:01+10:00: Start GC phase1 (mark used chunks)
2025-06-10T11:51:01+10:00: marked 3% (1 of 32 index files)
2025-06-10T11:51:13+10:00: marked 6% (2 of 32 index files)
2025-06-10T11:51:13+10:00: marked 9% (3 of 32 index files)
2025-06-10T11:51:13+10:00: marked 12% (4 of 32 index files)
...
2025-06-10T11:51:24+10:00: marked 96% (31 of 32 index files)
2025-06-10T11:51:24+10:00: marked 100% (32 of 32 index files)
2025-06-10T11:51:24+10:00: Start GC phase2 (sweep unused chunks)
2025-06-10T11:51:24+10:00: processed 1% (383 chunks)
2025-06-10T11:51:24+10:00: processed 2% (772 chunks)
2025-06-10T11:51:25+10:00: processed 3% (1167 chunks)
2025-06-10T11:51:25+10:00: processed 4% (1541 chunks)
...
2025-06-10T11:51:49+10:00: processed 97% (35361 chunks)
2025-06-10T11:51:49+10:00: processed 98% (35739 chunks)
2025-06-10T11:51:49+10:00: processed 99% (36150 chunks)
2025-06-10T11:51:50+10:00: Removed garbage: 881.239 KiB
2025-06-10T11:51:50+10:00: Removed chunks: 1
2025-06-10T11:51:50+10:00: Pending removals: 157.063 MiB (in 533 chunks)
2025-06-10T11:51:50+10:00: Original data usage: 645.442 GiB
2025-06-10T11:51:50+10:00: On-Disk usage: 85.091 GiB (13.18%)
2025-06-10T11:51:50+10:00: On-Disk chunks: 35979
2025-06-10T11:51:50+10:00: Deduplication factor: 7.59
2025-06-10T11:51:50+10:00: Average chunk size: 2.422 MiB
2025-06-10T11:51:50+10:00: queued notification (id=2e2bde95-ae8e-48e6-864d-b5349c8323a7)
2025-06-10T11:51:50+10:00: TASK OK
 
Last edited:
are both systems running the same version? what kind of disks are backing each?
 
Same version PVE and PBS. Both running on mirrored m.2 SATA ssd OS. Both datastore are 7200 HDD, faster one via NFS, slower one passed through.
Not that matters, but both are on 10gbe nic connected through Unifi Agg pro.
Also compared both PBSs' "option" and "hardware", everything seems identical apart from 20 cores and 32gb RAM on the slower machine vs 10 core and 8gb RAM on the faster one.
 
Last edited:
please provide the full version, VM configuration and actual storage details and disk models.. thanks!
 
  • Like
Reactions: news
Please tell us, the drive configuration, raid-system and file system.
How is the name of that 7200 RPM Drive(s) ?
 
AMD machine(faster):
CPU(s): 16 x AMD Ryzen 7 PRO 5750G with Radeon Graphics (1 Socket)
Kernel Version: Linux 6.8.12-11-pve (2025-05-22T09:39Z)
Manager Version: pve-manager/8.4.1/2a5fa54a8503f96d


Code:
# AMD PBS config
# Boot drive: 2 x m.2 SATA Micron 5100 MAX 400GB, ZFS mirror
# PBS Storage: 2 x Seagate Exos X18 16tb passed through to Synology(RAID1) VM, mapped to PBS via NFS
# VM image storage: 1 x SATA Micron 5300 MAX, LVM

boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
ide2: dsm5750ghdd:iso/proxmox-backup-server_3.4-1.iso,media=cdrom,size=1275816K
memory: 8192
meta: creation-qemu=9.2.0,ctime=1749357903
name: PBS-5750g
net0: virtio=BC:24:11:0E:0E:71,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: Data:vm-104-disk-0,iothread=1,size=32G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=d957d3bf-88bc-4d50-a9d8-b17294a56cef
sockets: 1
vmgenid: 080cf939-93f3-4568-affb-a7e50e5a6f92

Dell R740 machine(slower):
CPU(s): 40 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (1 Socket)
Kernel Version: Linux 6.8.12-10-pve (2025-04-18T07:39Z)
Manager Version: pve-manager/8.4.1/2a5fa54a8503f96d

Code:
# Dell PBS config
# Boot drive: 2 x m.2 SATA Micron 5100 240GB on Dell Boss S1 RAID1 PCIE card, ext4
# Storage: 1 x WD red plus 10TB, ZFS in PBS
# VM image storage: 2 x Dell Toshiba PX05SMB080Y SAS SSD, ZFS mirror

boot: order=scsi0;net0
cores: 20
cpu: x86-64-v2-AES
memory: 32768
meta: creation-qemu=9.2.0,ctime=1747282558
name: PBS-r740
net0: virtio=BC:24:11:A4:7F:14,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: Data:104/vm-104-disk-0.qcow2,iothread=1,size=32G
scsi1: /dev/disk/by-id/wwn-0x5000cca641e347e4,backup=0,size=488386584K
scsihw: virtio-scsi-single
smbios1: uuid=abf0d4db-ea4a-40eb-a66f-8ed0913c6420
sockets: 1
vmgenid: 49c755a2-2331-4bfa-9f68-8fe5315d4cc7
 
Last edited:
the interesting part is the PBS version. anyhow, since your slow PBS is using ZFS, this might be another instance of https://forum.proxmox.com/threads/gc-very-slow-after-3-4-update.166214/#post-772461 - could you try to gather the data requested there?
Both PBS are 3.4.1,
slow machine Kernel Version: Linux 6.8.12-10-pve (2025-04-18T07:39Z)
fast machine Kernel Version: Linux 6.8.12-11-pve (2025-05-22T09:39Z)

Will report back with results when I have time tomorrow to monitor a full GC run.
Another interesting find is similar speed difference (x60 time slower) when doing a "verify" job for similar amount of work on these 2 PBS.
 
Last edited:
Another interesting find is similar speed difference (x60 time slower) when doing a "verify" job for similar amount of work on these 2 PBS.
So just to double check, you say your ZFS backed datastore is 60 times slower than the NFS based datastore for verification? If so, can you verify the storage performance by independent random read IO benchmarks, e.g. via fio?
 
In my opinion you setup the wrong comparison:
1 single drive with ZFS with so low IOPS 4k random R/W against a nas with more drives and probably cache, will not fit a real compare.

If you want zfs, so do it i do, use only ssd (SATA3) as ZFS VDEV0 as ZFS 2x mirror and (stripe) VDEV1 as ZFS 2x mirror.
So you have aprox. 2x IOPS of IOPS 4k random write on SSDs and aprox. 4x IOPS of IOPS 4k random Read on SSDs.

If you want zfs with HDDs so change in the above setup the SSD vs. HDD and add a ZFS special device on VDEV2 2x mirror ssd as zfs special device.
This will now read and write all meta data to the very fast ZFS special device on a zfs mirror ssds and your hdds will not see any more of these data access ==> speed up.
 
In my opinion you setup the wrong comparison:
1 single drive with ZFS with so low IOPS 4k random R/W against a nas with more drives and probably cache, will not fit a real compare.

If you want zfs, so do it i do, use only ssd (SATA3) as ZFS VDEV0 as ZFS 2x mirror and (stripe) VDEV1 as ZFS 2x mirror.
So you have aprox. 2x IOPS of IOPS 4k random write on SSDs and aprox. 4x IOPS of IOPS 4k random Read on SSDs.

If you want zfs with HDDs so change in the above setup the SSD vs. HDD and add a ZFS special device on VDEV2 2x mirror ssd as zfs special device.
This will now read and write all meta data to the very fast ZFS special device on a zfs mirror ssds and your hdds will not see any more of these data access ==> speed up.
As described above, the NAS volume is a RAID1 HDD setup with no cache, considering the normally inferior performance people experience with NFS vs local, this result is puzzling. I'll try the RAID1 SSD on the R740 machine's Synology VM(slower machine) as datastore tomorrow and report back.
 
Done the test on the R740 with NFS, much faster performance just like the AMD server. So it's definitely something to do with the original datastore config, filesystem or hardware.