[SOLVED] Garbage collection phase1 does not mark (m)any chunks

pallingerpeter

New Member
Jan 17, 2024
14
1
3
Currently, garbage collection tries to delete most of my chunks. I have about 4M chunks, at list half of which should be used, and the GC log goes like this:

Code:
2024-09-27T17:01:19+02:00: starting garbage collection on store backup-cephfs
2024-09-27T17:01:19+02:00: Start GC phase1 (mark used chunks)
2024-09-27T17:09:29+02:00: Start GC phase2 (sweep unused chunks)
2024-09-27T17:10:08+02:00: processed 1% (46618 chunks)
At this point I stopped it manually, then restored the chunks from the synchronised server. 40k*100 ~ 4M, so it was going to delete most of my chunks. It did so twice before, and had to manually rsync chunks back from the replica server.

As far as I understand, GC phase1 should update the access times for referenced chunks so phase2 can delete the chunks with old (>24h+5min) access times.
The problem is, no files access times are updated.
As far as I can tell, the number of chunks with fresh access times does not change during phase1.

The chunks are on a cephfs mount. The *.fidx files are on a different (ext4-on-ceph-rbd) mount. I understand that cephfs has an outstanding feature request https://tracker.ceph.com/issues/43337 for correctly handling atime, and I successfully confirmed that reading a file does not update its atime. However, manually updating chunk times still works:

Code:
# F=.chunks/0000/0000......... ; ls -lu $F ; cat $F >/dev/null; ls -lu $F ; touch -a $F ; ls -lu $F
-rw-r--r-- 2 backup backup 2605720 Sep 11 09:10 .chunks/0000/0000....
-rw-r--r-- 2 backup backup 2605720 Sep 11 09:10 .chunks/0000/0000....
-rw-r--r-- 2 backup backup 2605720 Sep 28 13:52 .chunks/0000/0000....

I looked into the code and it seems it uses the utimensatlibc call, so the cephfs atime bug should not be a concern (however, I do not really know rust, so maybe I interpreted something wrong).

My main question is: how can I effectively debug the gc procedure to know what the problem is.
  • Maybe it does not scan most images (phase1 time is certainly fast, but it is possible to read the ~2GB of image files in roughly 9 minutes)
    • Can you make it print what image's chunks are being scanned?
  • Maybe it can not/ does not try to update atimes
    • Can you make it print which chunks' atimes are being set?

Thanks in advance!
 
Last edited:
Update: I tried moving the .fidx files back to the cephfs mount, and running a GC.

It runs differently:
Code:
2024-09-30T09:22:15+02:00: starting garbage collection on store backup-cephfs
2024-09-30T09:22:15+02:00: Start GC phase1 (mark used chunks)
2024-09-30T10:08:09+02:00: WARN: warning: unable to access non-existent chunk 4b38e5765f6615a2c33615ab7a264d4bc12986f358315a3f1bf13085ce7fa88f, required by "/mnt/backup-cephfs/ns/backup/vm/405/2024-02-21T15:40:53Z/drive-virtio0.img.fidx"
...
2024-09-30T10:08:42+02:00: WARN: warning: unable to access non-existent chunk 4d5dbc9c991b1648a6fa3cccf8e24b7c752e97a2f754d6520979ad116fb5436a, required by "/mnt/backup-cephfs/ns/backup/vm/405/2024-02-21T15:40:53Z/drive-virtio0.img.fidx"
2024-09-30T10:11:41+02:00: marked 1% (23 of 2273 index files)
....

It takes a lot more time, and the amount of files with old atime seems to decrease slowly (I measure it with find /mnt/backup-cephfs/.chunks/00* -atime +1 -type f | wc, I obviously do not want to scan all chunks periodically). It is still running, and may take up a whole day.

The preliminary conclusion seems to be: garbage collection does not work properly if the fidx files are not on the same filesystem as the chunks. I will write again when the GC process finishes.
 
Hi there. I just recently experienced a very similar issue, with GC skipping phase 1 and marking everything as unused. The reason I had this happening was that my "ns" directory was behind a symlink. As soon as I stopped using symlinks for files representing snapshots (pretty much everything alongside the ".chunks" direkctory) the GC started to behave.

Check if this helps you: https://forum.proxmox.com/threads/garbage-collection-skipped-first-phase.135931/post-602983
And here the longer version of my experience: https://forum.proxmox.com/threads/im-losing-backup-chunks.155575/

Hope this helps
 
  • Like
Reactions: pallingerpeter

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!