Understanding PBS garbage notification

n8ur · Dec 3, 2024

I am trying to understand the message I get daily from PBS following garbage cleanup:

Datastore: proxmox-backup
Task ID: UPID:fluffles-pbs:000002A6:000009B8:0000004D:674E9050:garbage_collection

roxmox\x2dbackup:root@pam:
Index file count: 896
Removed garbage: 732.203 MiB
Removed chunks: 693
Removed bad chunks: 0
Leftover bad chunks: 0
Pending removals: 33.209 GiB (in 24965 chunks)
Original Data usage: 1.918 TiB
On-Disk usage: 0 B (0.00%)
On-Disk chunks: 0
Deduplication Factor: 1.00
Garbage collection successful.

In particular, what to make of the original data usage of 1.918TB but on-disk usage of 0B. I run 2x daily backups of my three nodes, and those always report successful for all VMs/containers, and I've done a "verify" in PBS of all the backups and that passes. Should I be concerned about the 0 byte disk usage?

Thanks,
John

Chris · Dec 3, 2024

Hi,
what storage is used to back the datastore? Do you have symlinks or the like in the datastores path?

n8ur · Dec 3, 2024

Chris said:
Hi,
what storage is used to back the datastore? Do you have symlinks or the like in the datastores path?

Thanks for the quick reply! PBS is running in a virtual machine on a Synology NAS and using the local btrfs volume for storage. As far as I know, there are no symlinks involved. I just looked directly in the storage directory, and there are about 35GB used, which is about what I would expect for the containers I am backing.

Chris · Dec 4, 2024

n8ur said:
Pending removals: 33.209 GiB (in 24965 chunks)

n8ur said:
and there are about 35GB used

So I would guess that garbage collection does mark all of the chunks as pending for removal. Is there an active backup writer job? As chunks with atime smaller than the start time of the oldest still active backup writer instance are marked as pending, which in your case seems to be all of them. Do you have other processes which might have touched the files in-between, updating their atime (e.g. an external tool reading the chunks)?

Further, check the atime setting of your filesystem, as garbage collection marks chunks as in use by updating their chunk files atime. See also https://pbs.proxmox.com/docs/maintenance.html#garbage-collection

n8ur · Dec 4, 2024

Chris said:
So I would guess that garbage collection does mark all of the chunks as pending for removal. Is there an active backup writer job? As chunks with atime smaller than the start time of the oldest still active backup writer instance are marked as pending, which in your case seems to be all of them. Do you have other processes which might have touched the files in-between, updating their atime (e.g. an external tool reading the chunks)?

Further, check the atime setting of your filesystem, as garbage collection marks chunks as in use by updating their chunk files atime. See also https://pbs.proxmox.com/docs/maintenance.html#garbage-collection

Thanks, Chris! So, a couple of interesting things from this -- I am doing a nightly backup of the pbs storage pool via rsnapshot from another machine (backup of the backup!) so that might be touching the data. As an experiment, I just disabled that backup to see if it makes a difference. I will report back on the results of that tomorrow.

The filesystem is btrfs and the options for "recording access time" is daily, monthly, or never. It's currently set to "never." I don't know if changing to daily would make a worthwhile difference, but I can try that as well.

Chris · Dec 4, 2024

n8ur said:
The filesystem is btrfs and the options for "recording access time" is daily, monthly, or never. It's currently set to "never." I don't know if changing to daily would make a worthwhile difference, but I can try that as well.

This has to be set to daily, otherwise you risk loosing data chunks still in use by some of your backup snapshots! This is a requirement for a filesystem backing a PBS datastore.

n8ur said:
I am doing a nightly backup of the pbs storage pool via rsnapshot from another machine (backup of the backup!) so that might be touching the data. As an experiment, I just disabled that backup to see if it makes a difference. I will report back on the results of that tomorrow.

That will indeed interfere! This is also dangerous in the sense that it might sync inconsistent states to your remote if you have concurrent operation ongoing on the datastore, PBS sync jobs are the recommended way for offsite backups, see https://pbs.proxmox.com/docs/managing-remotes.html#sync-jobs.

Nevertheless, this is somewhat unexpected if atime is deactivated on your storage altogether. So there might still be something else at play.

n8ur · Dec 4, 2024

I"ve set the recording access time back to daily so we'll see how that changes things. Thanks so much for your help!

n8ur · Dec 8, 2024

I've now been running for several days with the rsnapshot backup disabled (so nothing touching the PBS filestore) and access time recording set to daily in the Synology storage manager. Results don't seem to have changed, though:

Datastore: proxmox-backup
Task ID: UPID:fluffles-pbs:0000027C:000003F5:0000008B:675527D0:garbage_collection

roxmox\x2dbackup:root@pam:
Index file count: 962
Removed garbage: 719.948 MiB
Removed chunks: 652
Removed bad chunks: 0
Leftover bad chunks: 0
Pending removals: 36.062 GiB (in 27709 chunks)
Original Data usage: 2.088 TiB
On-Disk usage: 0 B (0.00%)
On-Disk chunks: 0
Deduplication Factor: 1.00
Garbage collection successful.

Still no bad chunks removed and still 0 B of "On-Disk usage". Any further suggestions?

Thanks,
John

Chris · Dec 9, 2024

Can you verify the atime of the chunks is updated by the garbage collection? Running find /<your-datastore-path>/.chunks -type f -amin -1440 | wc -l will give you the number of chunks with atime less than a day and find /<your-datastore-path>/.chunks -type f -amin +1440 | wc -l those older than a day.

While looking at the code more closely, the chunks being exclusively accounted as pending chunks would indicate that there is an oldest writer instance which reduces the min atime for chunks to be considered as removable. Could you check if there is a backup job running at the same time as garbage collection?

Edit: Fixed typo.

tinus_p · Dec 9, 2024

I am running into the same problems. My setup is:

* A linux server has a ZFS pool with a filesystem
* The filesystem is exported over NFS
* PBS is installed on a PVE server that mounts the NFS filesystem using the interface on the PVE server
* PBS has a storage on that NFS filesystem

I have created a fresh storage and backed up one container onto it. If I run the garbage collect, all the contents are deleted.

If I recreate the backup, then run the first find command both on the NFS server as on the PBS server, I get 0 results. If I run the second command, I get the same results as running find only filtering for files, so all the chunks.

If I stat one of the chunk files the Access time is the Unix epoch in 1970.

For me this problem appears to have started when I upgraded the NFS server from ubuntu 22 to ubuntu 24. I have checked and zfs for the pool has atime=on and relatime=off.

Regardless of my problem, for reliability I would strongly suggest the backup or gc procedure or at least the storage creation process involve a step that tests if atime actually works, so that the garbage collection step is not a footgun waiting to go off. Also, perhaps avoid deleting chunks created in 1970 as clearly that means something is off.

Chris · Dec 9, 2024

tinus_p said:
For me this problem appears to have started when I upgraded the NFS server from ubuntu 22 to ubuntu 24. I have checked and zfs for the pool has atime=on and relatime=off.

What about the dataset which is shared via NFS, does it have atime and relatime maybe set individually?

tinus_p said:
Regardless of my problem, for reliability I would strongly suggest the backup or gc procedure or at least the storage creation process involve a step that tests if atime actually works, so that the garbage collection step is not a footgun waiting to go off. Also, perhaps avoid deleting chunks created in 1970 as clearly that means something is off.

Thanks for the suggestion, opened an issue for this here.

tinus_p · Dec 9, 2024

As far as I know I checked the dataset:

zfs get all largedata | grep time
largedata atime on default
largedata relatime off local

I'm pretty sure the problem is related to the NFS server because when I read the chunk file on the NFS client (the PBS backup server) the atime does not update, but when I read it on the NFS server it does update.

I also found in the Kernel NFS Server which apparently is used on this version of Ubuntu that they don't really support atime because according to them it's impossible due to caching, similarly on the client side and all the options are ignored.

n8ur · Dec 10, 2024

Chris said:
Can you verify the atime of the chunks is updated by the garbage collection? Running find /<your-datastore-path>/.chunks -type f -amin -3600 | wc -l will give you the number of chunks with atime less than a day and find /<your-datastore-path>/.chunks -type f -amin +3600 | wc -l those older than a day.

While looking at the code more closely, the chunks being exclusively accounted as pending chunks would indicate that there is an oldest writer instance which reduces the min atime for chunks to be considered as removable. Could you check if there is a backup job running at the same time as garbage collection?

Edit: Fixed typo.

Output of the first command: 29587
Output of the second command: 0

The backup job runs from PVE twice daily: "2,14:30" and garbage collection on PBS runs at midnight.

EDIT: Given the input from Tinus P, I should note that the PBS datastore is an NFS share on the host. So if it's an NFS thing, we could be seeing the same issue. (Also a suggestion -- looking around in the PBS gui, I wasn't able to find anywhere that told me the datastore was in fact NFS. Should that be apparent somewhere?)

Chris · Dec 10, 2024

tinus_p said:
I also found in the Kernel NFS Server which apparently is used on this version of Ubuntu that they don't really support atime because according to them it's impossible due to caching, similarly on the client side and all the options are ignored.

Could you share a reference/link for this? But yes, if the atime is not honored, than that storage disqualifies to be used for a datastore.

Chris · Dec 10, 2024

n8ur said:
Output of the first command: 29587
Output of the second command: 0

That output would indicate however, that all of your chunks have an atime less than the cutoff time, so either all of the chunks are still referenced by backup snapshots or some other process is accessing the chunks.

n8ur said:
Given the input from Tinus P, I should note that the PBS datastore is an NFS share on the host. So if it's an NFS thing, we could be seeing the same issue. (Also a suggestion -- looking around in the PBS gui, I wasn't able to find anywhere that told me the datastore was in fact NFS. Should that be apparent somewhere?)

Given that you do get an atime update, this is very likely not the same issue.You do get the exact opposite behavior, all chunks being held back in contrast to non of the chunks being held back.
Also, the datastore is agnostic to what storage it resides on and setting up the NFS share if it is to be used as datastore is in the responsibility of the administrator, so adding such information would only ever make sense if the NFS setup itself is a supported feature in the WebUI. Fast local storage is still the recommended storage setup for a datastore.

Could you try and see what cat /run/proxmox-backup/active-operations/<your-datastore-name> gives? Also, please try to restart the proxmox backup related services by running systemctl restart proxmox-backup-proxy.service proxmox-backup.service.

Johannes S · Dec 10, 2024

Chris said:
That output would indicate however, that all of your chunks have an atime less than the cutoff time, so either all of the chunks are still referenced by backup snapshots or some other process is accessing the chunks.

One way to test this would be to setup another datastore (on NFS or (prefferably!) local) and create a sync job to the new datastore. Afterwards run a prune job on the new datastore and remove some other snapshots in it by hand (the original data is still save on the old datastore). In that case I would expect that after the next two garbage collection jobs chunks in the new datastore will be removed.

n8ur · Dec 10, 2024

Chris said:
That output would indicate however, that all of your chunks have an atime less than the cutoff time, so either all of the chunks are still referenced by backup snapshots or some other process is accessing the chunks.

Given that you do get an atime update, this is very likely not the same issue.You do get the exact opposite behavior, all chunks being held back in contrast to non of the chunks being held back.
Also, the datastore is agnostic to what storage it resides on and setting up the NFS share if it is to be used as datastore is in the responsibility of the administrator, so adding such information would only ever make sense if the NFS setup itself is a supported feature in the WebUI. Fast local storage is still the recommended storage setup for a datastore.

Could you try and see what cat /run/proxmox-backup/active-operations/<your-datastore-name> gives? Also, please try to restart the proxmox backup related services by running systemctl restart proxmox-backup-proxy.service proxmox-backup.service.

Looking in /run/proxmox-backup/active-operations/proxmox-backuop I see:
[{"pid":636,"starttime":1013,"active_operations":{"read":0,"write":0}}]

I've restarted the service and now it's:
[{"pid":6809,"starttime":51318952,"active_operations":{"read":0,"write":0}}]

As I mentioned earlier, I'm running PBS in a VM on a Synology NAS. If using NFS to access the main datastore is not appropriate, I guess I can attach another disk to the VM to use as the datastore.

tinus_p · Dec 11, 2024

Chris said:
Could you share a reference/link for this? But yes, if the atime is not honored, than that storage disqualifies to be used for a datastore.

The client side is here:

https://www.man7.org/linux/man-pages/man5/nfs.5.html

which, the way I read it, makes it really unfortunate because apart from testing the client has no way of knowing whether the server is going to do atime updates.

I can no longer find where I read that the server doesn't do atime.

If an nfs mount is not suitable as storage, what is the recommended option if you want to store the chunk files on a remote linux server that is not pbs server? I would really hesitate to do fuse mounts or something like that and I doubt the performance would be acceptable with the directory with 65538 subdirectories.

Chris · Dec 11, 2024

tinus_p said:
which, the way I read it, makes it really unfortunate because apart from testing the client has no way of knowing whether the server is going to do atime updates.

This only tells that the atime is not persisted to the server side instantaneously, but cached by the client for improved performance, in contrast to mtime and ctime. It does not state that the atime is never persisted on the server side.

tinus_p said:
If an nfs mount is not suitable as storage, what is the recommended option if you want to store the chunk files on a remote linux server that is not pbs server?

While not recommended, it is perfectly fine to use NFS for backing your datastore, as long as you can live with the known, possible limitations, especially with respect to performance. AFAIK there are a lot of setups working just fine with NFS within the given constraints.

Chris · Dec 11, 2024

n8ur said:
As I mentioned earlier, I'm running PBS in a VM on a Synology NAS. If using NFS to access the main datastore is not appropriate, I guess I can attach another disk to the VM to use as the datastore.

Yes, please try using a local storage, that could help narrow down what the problem is.

Understanding PBS garbage notification

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

New Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

New Member

Member

Proxmox Staff Member

Proxmox Staff Member

We value your privacy