Garbage collection skipped first phase

v1da · Nov 4, 2023

Hello friends!

We have a proxmox backup server 2.4-1

We recently moved from one datastore to another with ZFS
We moved the chunks folder and changed the configuration settings, connected the datastore and it seemed that everything was ok, the tasks were saved, the namespaces were available. All backups/verifys/prunes are performed properly, but when starting the Garbage Collection, phase 1 “marked” is skipped/ignored(?) and immediately goes to phase 2 - “processed”, this did not happen before and the GC started from the first phase. After the start of the GC, 16% managed to complete and on a datastore of 37TB of data, about 6TB of data was deleted from the datastore, which can be seen on the dashboard. It looks like all data is being deleted from the datastore, 16% of 37TB ~ 6TB
After executing the prune, the logs show which backups are marked for keep
What do I need to do to restore the correct operation of the GC?
And what happens to the deleted data? Can I say goodbye to them?

2023-11-04T09:44:30+05:00: starting garbage collection on store ds01
2023-11-04T09:44:30+05:00: Start GC phase1 (mark used chunks)
2023-11-04T09:44:30+05:00: Start GC phase2 (sweep unused chunks)
2023-11-04T10:01:59+05:00: processed 1% (191156 chunks)
2023-11-04T10:19:55+05:00: processed 2% (383124 chunks)
2023-11-04T10:39:00+05:00: processed 3% (575310 chunks)
2023-11-04T10:59:57+05:00: processed 4% (766738 chunks)
2023-11-04T11:23:52+05:00: processed 5% (957354 chunks)
2023-11-04T11:47:54+05:00: processed 6% (1146817 chunks)
2023-11-04T12:11:03+05:00: processed 7% (1335878 chunks)
2023-11-04T12:34:18+05:00: processed 8% (1524693 chunks)
2023-11-04T12:57:42+05:00: processed 9% (1713397 chunks)
2023-11-04T13:21:08+05:00: processed 10% (1903154 chunks)
2023-11-04T13:44:45+05:00: processed 11% (2092344 chunks)
2023-11-04T14:08:25+05:00: processed 12% (2281452 chunks)
2023-11-04T14:32:23+05:00: processed 13% (2469877 chunks)
2023-11-04T14:55:54+05:00: processed 14% (2659058 chunks)
2023-11-04T15:19:36+05:00: processed 15% (2848080 chunks)
2023-11-04T15:42:27+05:00: processed 16% (3036123 chunks)
2023-11-04T15:48:00+05:00: received abort request ...
2023-11-04T15:48:00+05:00: TASK ERROR: abort requested - aborting task

this is an example of a prune log

2023-11-03T18:00:00+05:00: prune job 's-acf028c4-eac7'
2023-11-03T18:00:00+05:00: task triggered by schedule '18:00'
2023-11-03T18:00:00+05:00: Starting datastore prune on datastore 'ds01', namespace 'AD', down to full depth
2023-11-03T18:00:00+05:00: retention options: --ns AD --keep-last 1 --keep-daily 5
2023-11-03T18:00:00+05:00: Pruning group AD:"vm/511"
2023-11-03T18:00:00+05:00: remove vm/511/2023-10-02T23:30:02Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-10-03T23:30:00Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-10-04T23:30:02Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-10-31T03:14:04Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-10-31T23:30:01Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-11-01T23:30:04Z
2023-11-03T18:00:00+05:00: keep vm/511/2023-11-02T23:30:07Z
2023-11-03T18:00:00+05:00: TASK OK

Thanks!

Dunuin · Nov 4, 2023

GC Phase 1 is updating the atime of the chunk files. Does your new storage actually support storing/updating an atime? Not all do this by default.
But with problems in the first phase the second phase should abort to not delete too much chunks.

v1da · Nov 4, 2023

Dunuin said:
GC Phase 1 is updating the atime of the chunk files. Does your new storage actually support storing/updating an atime? Not all do this by default.
But with problems in the first phase the second phase should abort to not delete too much chunks.

Hello! Storage mounting with option relatime

dcsapak · Nov 6, 2023

how did you move the backup snapshot to the new place?

one thing that comes to mind is that gc does not resolve symlinks, so if your indices are behind one, gc won't find them (and thus not marking any chunks as needed)

v1da · Nov 8, 2023

dcsapak said:
how did you move the backup snapshot to the new place?

one thing that comes to mind is that gc does not resolve symlinks, so if your indices are behind one, gc won't find them (and thus not marking any chunks as needed)

Hi!
Yes, we used symlinks, but on the old datastore (the space in the array ran out, chunks from c000 to ffff were connected to the new zfs array using symlinks), after creating the zfs pool, we disabled backups and tasks, using rsync we transferred all chunks to the new datastore (chunks that were connected by symlinks remained there). The symlinks themselves remained, but on the old datastore. I’m not sure how a symlink can affect when creating a backup, and how it affects the GC in the future. But perhaps this is just our case. Could you describe in more detail about this?
Thanks!

dcsapak · Nov 8, 2023

i don't completely understand what you did, but manually moving folder/creating symlinks/etc. is not really a supported setup.

anyway what i meant was that during garbage collection we iterate through the datastore looking for indices (the 'snapshots') which reference the chunks
during this iteration we do not resolve or follow symlinks and thus if the indices are behind one, we don't mark their chunks as in use

i don't know what side effect symlinking the chunk folders have, but i would not guarantee that it works as expected

v1da · Nov 8, 2023

dcsapak said:
i don't completely understand what you did, but manually moving folder/creating symlinks/etc. is not really a supported setup.

anyway what i meant was that during garbage collection we iterate through the datastore looking for indices (the 'snapshots') which reference the chunks
during this iteration we do not resolve or follow symlinks and thus if the indices are behind one, we don't mark their chunks as in use

i don't know what side effect symlinking the chunk folders have, but i would not guarantee that it works as expected

ok, I'll try to explain (sorry, English is not my native language):
We were running out of space on the datastore, let it be ds01. We are limited by the number of disks connected to the system. It was decided to purchase new larger disks. On these discs we made a ZFS raid-z 2-0. Let it be ds02. Then, because the space was running out, we made symlinks to the ds02 from c000 to ffff on the old datastore ds01.

as an example:

Bash:

ln -s /ds01/c000..ffff /ds02/c000..ffff

At that time, ds02 was not yet a full-fledged datastore, it was just an array connected to the system; at that time we did not connect it to PBS. Some of the new backups were recorded using these symlinks to the new storage. Next, we stopped all jobs and backups. Using rsync we transferred all chunks from ds01 to ds02. All symlinks remained on ds01. This datastore is currently disabled. We connected ds02 as the main datastore. There are no symlinks on the ds02 datastore; all the chunks that we transferred using rsync are located there.
The GC did not start at all during all these actions.
Do I understand correctly that symlinks cannot be used even during backups, and maybe because of this, the chain of our snapshots was broken somewhere?
Thanks!

dcsapak · Nov 8, 2023

v1da said:
Do I understand correctly that symlinks cannot be used even during backups, and maybe because of this, the chain of our snapshots was broken somewhere?

no that's not what i meant. symlinking the chunk dirs *should* not break anything (altough no guarantees there), but symlinking the snapshots would cause breakage during garbage collection (e.g. the vm/ct/host folders)

so if the snapshot dirs (vm/ct/ns/etc.) are not symlnks, is there anything special going on? acls/permssions?

what is the output of

Code:

ls -lhR /path/to/datastore

?

v1da · Nov 8, 2023

dcsapak said:
so if the snapshot dirs (vm/ct/ns/etc.) are not symlnks, is there anything special going on? acls/permssions?

Ok, I understand you, perhaps our case also fits this

dcsapak said:
what is the output of

Code:

ls -lhR /path/to/datastore

Code:

ls -lhR /mnt/datastore/data-zfs/
/mnt/datastore/data-zfs/:
total 512
lrwxrwxrwx 1 root root 29 Nov  1 19:13 ns -> /mnt/ns_datastore/data-zfs/ns

we did this because the zfs is initialized later than the ns folder is mounted, and the directory itself is located on a fast ssd

Search

Search

Garbage collection skipped first phase

v1da

New Member

Dunuin

Distinguished Member

v1da

New Member

dcsapak

Proxmox Staff Member

v1da

New Member

dcsapak

Proxmox Staff Member

v1da

New Member

dcsapak

Proxmox Staff Member

v1da

New Member

We value your privacy