Extremely slow garbage collection

slowcar · Jun 10, 2025

I've got 2 machine both have PBS running as VM, both allocated 10 cores, 8gb RAM.
Machine A has AMD pro 5750G, NFS on the same physical machine mapped(Synology VM) as datastore, garbage collection takes under 1 minute.
Machine B is a Dell R740 with Xeon 6138, HDD passed through as datastore, garbage collection takes about 1 hour without much new data.
I've heard NFS datastore are typically slower, but not in my case. Any idea why it's taking so long considering the amount of data for both jobs are quite similar?

One difference I noticed is RAM usage while verifying and garbage collecting. The AMD(faster) machine used 95% RAM while the R740(slower) machine only uses about 6%-8%.

Code:

# R740 garbage collection, 59 minutes. Local HDD datastore.

2025-06-10T11:40:00+10:00: starting garbage collection on store backups
2025-06-10T11:40:00+10:00: task triggered by schedule 'tue 11:40'
2025-06-10T11:40:01+10:00: Access time update check successful, proceeding with GC.
2025-06-10T11:40:01+10:00: Using access time cutoff 1d 5m, minimum access time is 2025-06-09T01:35:00Z
2025-06-10T11:40:01+10:00: Start GC phase1 (mark used chunks)
2025-06-10T11:42:06+10:00: marked 2% (1 of 37 index files)
2025-06-10T11:42:06+10:00: marked 5% (2 of 37 index files)
2025-06-10T11:42:12+10:00: marked 8% (3 of 37 index files)
...
2025-06-10T12:09:33+10:00: marked 94% (35 of 37 index files)
2025-06-10T12:09:37+10:00: marked 97% (36 of 37 index files)
2025-06-10T12:09:37+10:00: marked 100% (37 of 37 index files)
2025-06-10T12:09:37+10:00: Start GC phase2 (sweep unused chunks)
2025-06-10T12:09:56+10:00: processed 1% (522 chunks)
2025-06-10T12:10:14+10:00: processed 2% (1046 chunks)
2025-06-10T12:10:33+10:00: processed 3% (1599 chunks)
2025-06-10T12:10:52+10:00: processed 4% (2118 chunks)
2025-06-10T12:11:10+10:00: processed 5% (2614 chunks)
2025-06-10T12:11:28+10:00: processed 6% (3107 chunks)
...
2025-06-10T12:39:38+10:00: processed 99% (50406 chunks)
2025-06-10T12:39:56+10:00: Removed garbage: 0 B
2025-06-10T12:39:56+10:00: Removed chunks: 0
2025-06-10T12:39:56+10:00: Original data usage: 757.442 GiB
2025-06-10T12:39:56+10:00: On-Disk usage: 127.482 GiB (16.83%)
2025-06-10T12:39:56+10:00: On-Disk chunks: 50897
2025-06-10T12:39:56+10:00: Deduplication factor: 5.94
2025-06-10T12:39:56+10:00: Average chunk size: 2.565 MiB
2025-06-10T12:39:56+10:00: TASK OK

Code:

#AMD Pro 5750G garbage collection, 50 seconds. NFS datastore.

2025-06-10T11:51:00+10:00: starting garbage collection on store backups-dsm
2025-06-10T11:51:00+10:00: task triggered by schedule '11:51'
2025-06-10T11:51:01+10:00: Access time update check successful, proceeding with GC.
2025-06-10T11:51:01+10:00: Using access time cutoff 1d 5m, minimum access time is 2025-06-09T01:46:00Z
2025-06-10T11:51:01+10:00: Start GC phase1 (mark used chunks)
2025-06-10T11:51:01+10:00: marked 3% (1 of 32 index files)
2025-06-10T11:51:13+10:00: marked 6% (2 of 32 index files)
2025-06-10T11:51:13+10:00: marked 9% (3 of 32 index files)
2025-06-10T11:51:13+10:00: marked 12% (4 of 32 index files)
...
2025-06-10T11:51:24+10:00: marked 96% (31 of 32 index files)
2025-06-10T11:51:24+10:00: marked 100% (32 of 32 index files)
2025-06-10T11:51:24+10:00: Start GC phase2 (sweep unused chunks)
2025-06-10T11:51:24+10:00: processed 1% (383 chunks)
2025-06-10T11:51:24+10:00: processed 2% (772 chunks)
2025-06-10T11:51:25+10:00: processed 3% (1167 chunks)
2025-06-10T11:51:25+10:00: processed 4% (1541 chunks)
...
2025-06-10T11:51:49+10:00: processed 97% (35361 chunks)
2025-06-10T11:51:49+10:00: processed 98% (35739 chunks)
2025-06-10T11:51:49+10:00: processed 99% (36150 chunks)
2025-06-10T11:51:50+10:00: Removed garbage: 881.239 KiB
2025-06-10T11:51:50+10:00: Removed chunks: 1
2025-06-10T11:51:50+10:00: Pending removals: 157.063 MiB (in 533 chunks)
2025-06-10T11:51:50+10:00: Original data usage: 645.442 GiB
2025-06-10T11:51:50+10:00: On-Disk usage: 85.091 GiB (13.18%)
2025-06-10T11:51:50+10:00: On-Disk chunks: 35979
2025-06-10T11:51:50+10:00: Deduplication factor: 7.59
2025-06-10T11:51:50+10:00: Average chunk size: 2.422 MiB
2025-06-10T11:51:50+10:00: queued notification (id=2e2bde95-ae8e-48e6-864d-b5349c8323a7)
2025-06-10T11:51:50+10:00: TASK OK

news · Jun 10, 2025

Do you read this ?

# https://www.proxmox.com/en/products/proxmox-backup-server/requirements
# https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements

Part SSD..

fabian · Jun 10, 2025

are both systems running the same version? what kind of disks are backing each?

slowcar · Jun 10, 2025

Same version PVE and PBS. Both running on mirrored m.2 SATA ssd OS. Both datastore are 7200 HDD, faster one via NFS, slower one passed through.
Not that matters, but both are on 10gbe nic connected through Unifi Agg pro.
Also compared both PBSs' "option" and "hardware", everything seems identical apart from 20 cores and 32gb RAM on the slower machine vs 10 core and 8gb RAM on the faster one.

fabian · Jun 10, 2025

please provide the full version, VM configuration and actual storage details and disk models.. thanks!

news · Jun 10, 2025

Please tell us, the drive configuration, raid-system and file system.
How is the name of that 7200 RPM Drive(s) ?

slowcar · Jun 10, 2025

AMD machine(faster):
CPU(s): 16 x AMD Ryzen 7 PRO 5750G with Radeon Graphics (1 Socket)
Kernel Version: Linux 6.8.12-11-pve (2025-05-22T09:39Z)
Manager Version: pve-manager/8.4.1/2a5fa54a8503f96d

Code:

# AMD PBS config
# Boot drive: 2 x m.2 SATA Micron 5100 MAX 400GB, ZFS mirror
# PBS Storage: 2 x Seagate Exos X18 16tb passed through to Synology(RAID1) VM, mapped to PBS via NFS
# VM image storage: 1 x SATA Micron 5300 MAX, LVM

boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
ide2: dsm5750ghdd:iso/proxmox-backup-server_3.4-1.iso,media=cdrom,size=1275816K
memory: 8192
meta: creation-qemu=9.2.0,ctime=1749357903
name: PBS-5750g
net0: virtio=BC:24:11:0E:0E:71,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: Data:vm-104-disk-0,iothread=1,size=32G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=d957d3bf-88bc-4d50-a9d8-b17294a56cef
sockets: 1
vmgenid: 080cf939-93f3-4568-affb-a7e50e5a6f92

Dell R740 machine(slower):
CPU(s): 40 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (1 Socket)
Kernel Version: Linux 6.8.12-10-pve (2025-04-18T07:39Z)
Manager Version: pve-manager/8.4.1/2a5fa54a8503f96d

Code:

# Dell PBS config
# Boot drive: 2 x m.2 SATA Micron 5100 240GB on Dell Boss S1 RAID1 PCIE card, ext4
# Storage: 1 x WD red plus 10TB, ZFS in PBS
# VM image storage: 2 x Dell Toshiba PX05SMB080Y SAS SSD, ZFS mirror

boot: order=scsi0;net0
cores: 20
cpu: x86-64-v2-AES
memory: 32768
meta: creation-qemu=9.2.0,ctime=1747282558
name: PBS-r740
net0: virtio=BC:24:11:A4:7F:14,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: Data:104/vm-104-disk-0.qcow2,iothread=1,size=32G
scsi1: /dev/disk/by-id/wwn-0x5000cca641e347e4,backup=0,size=488386584K
scsihw: virtio-scsi-single
smbios1: uuid=abf0d4db-ea4a-40eb-a66f-8ed0913c6420
sockets: 1
vmgenid: 49c755a2-2331-4bfa-9f68-8fe5315d4cc7

fabian · Jun 10, 2025

the interesting part is the PBS version. anyhow, since your slow PBS is using ZFS, this might be another instance of https://forum.proxmox.com/threads/gc-very-slow-after-3-4-update.166214/#post-772461 - could you try to gather the data requested there?

slowcar · Jun 10, 2025

fabian said:
the interesting part is the PBS version. anyhow, since your slow PBS is using ZFS, this might be another instance of https://forum.proxmox.com/threads/gc-very-slow-after-3-4-update.166214/#post-772461 - could you try to gather the data requested there?

Both PBS are 3.4.1,
slow machine Kernel Version: Linux 6.8.12-10-pve (2025-04-18T07:39Z)
fast machine Kernel Version: Linux 6.8.12-11-pve (2025-05-22T09:39Z)

Will report back with results when I have time tomorrow to monitor a full GC run.
Another interesting find is similar speed difference (x60 time slower) when doing a "verify" job for similar amount of work on these 2 PBS.

Chris · Jun 10, 2025

slowcar said:
Another interesting find is similar speed difference (x60 time slower) when doing a "verify" job for similar amount of work on these 2 PBS.

So just to double check, you say your ZFS backed datastore is 60 times slower than the NFS based datastore for verification? If so, can you verify the storage performance by independent random read IO benchmarks, e.g. via fio?

news · Jun 10, 2025

In my opinion you setup the wrong comparison:
1 single drive with ZFS with so low IOPS 4k random R/W against a nas with more drives and probably cache, will not fit a real compare.

If you want zfs, so do it i do, use only ssd (SATA3) as ZFS VDEV0 as ZFS 2x mirror and (stripe) VDEV1 as ZFS 2x mirror.
So you have aprox. 2x IOPS of IOPS 4k random write on SSDs and aprox. 4x IOPS of IOPS 4k random Read on SSDs.

If you want zfs with HDDs so change in the above setup the SSD vs. HDD and add a ZFS special device on VDEV2 2x mirror ssd as zfs special device.
This will now read and write all meta data to the very fast ZFS special device on a zfs mirror ssds and your hdds will not see any more of these data access ==> speed up.

slowcar · Jun 10, 2025

news said:
In my opinion you setup the wrong comparison:
1 single drive with ZFS with so low IOPS 4k random R/W against a nas with more drives and probably cache, will not fit a real compare.

If you want zfs, so do it i do, use only ssd (SATA3) as ZFS VDEV0 as ZFS 2x mirror and (stripe) VDEV1 as ZFS 2x mirror.
So you have aprox. 2x IOPS of IOPS 4k random write on SSDs and aprox. 4x IOPS of IOPS 4k random Read on SSDs.

If you want zfs with HDDs so change in the above setup the SSD vs. HDD and add a ZFS special device on VDEV2 2x mirror ssd as zfs special device.
This will now read and write all meta data to the very fast ZFS special device on a zfs mirror ssds and your hdds will not see any more of these data access ==> speed up.

As described above, the NAS volume is a RAID1 HDD setup with no cache, considering the normally inferior performance people experience with NFS vs local, this result is puzzling. I'll try the RAID1 SSD on the R740 machine's Synology VM(slower machine) as datastore tomorrow and report back.

slowcar · Jun 11, 2025

Done the test on the R740 with NFS, much faster performance just like the AMD server. So it's definitely something to do with the original datastore config, filesystem or hardware.

Search

Search

Extremely slow garbage collection

slowcar

New Member

news

Renowned Member

fabian

Proxmox Staff Member

slowcar

New Member

fabian

Proxmox Staff Member

news

Renowned Member

slowcar

New Member

fabian

Proxmox Staff Member

slowcar

New Member

Chris

Proxmox Staff Member

news

Renowned Member

slowcar

New Member

slowcar

New Member

We value your privacy