Currently, garbage collection tries to delete most of my chunks. I have about 4M chunks, at list half of which should be used, and the GC log goes like this:
At this point I stopped it manually, then restored the chunks from the synchronised server. 40k*100 ~ 4M, so it was going to delete most of my chunks. It did so twice before, and had to manually rsync chunks back from the replica server.
As far as I understand, GC phase1 should update the access times for referenced chunks so phase2 can delete the chunks with old (>24h+5min) access times.
The problem is, no files access times are updated.
As far as I can tell, the number of chunks with fresh access times does not change during phase1.
The chunks are on a cephfs mount. The
I looked into the code and it seems it uses the
My main question is: how can I effectively debug the gc procedure to know what the problem is.
Thanks in advance!
Code:
2024-09-27T17:01:19+02:00: starting garbage collection on store backup-cephfs
2024-09-27T17:01:19+02:00: Start GC phase1 (mark used chunks)
2024-09-27T17:09:29+02:00: Start GC phase2 (sweep unused chunks)
2024-09-27T17:10:08+02:00: processed 1% (46618 chunks)
As far as I understand, GC phase1 should update the access times for referenced chunks so phase2 can delete the chunks with old (>24h+5min) access times.
The problem is, no files access times are updated.
As far as I can tell, the number of chunks with fresh access times does not change during phase1.
The chunks are on a cephfs mount. The
*.fidx
files are on a different (ext4-on-ceph-rbd) mount. I understand that cephfs has an outstanding feature request https://tracker.ceph.com/issues/43337 for correctly handling atime, and I successfully confirmed that reading a file does not update its atime. However, manually updating chunk times still works:
Code:
# F=.chunks/0000/0000......... ; ls -lu $F ; cat $F >/dev/null; ls -lu $F ; touch -a $F ; ls -lu $F
-rw-r--r-- 2 backup backup 2605720 Sep 11 09:10 .chunks/0000/0000....
-rw-r--r-- 2 backup backup 2605720 Sep 11 09:10 .chunks/0000/0000....
-rw-r--r-- 2 backup backup 2605720 Sep 28 13:52 .chunks/0000/0000....
I looked into the code and it seems it uses the
utimensat
libc call, so the cephfs atime bug should not be a concern (however, I do not really know rust, so maybe I interpreted something wrong).My main question is: how can I effectively debug the gc procedure to know what the problem is.
- Maybe it does not scan most images (phase1 time is certainly fast, but it is possible to read the ~2GB of image files in roughly 9 minutes)
- Can you make it print what image's chunks are being scanned?
- Maybe it can not/ does not try to update atimes
- Can you make it print which chunks' atimes are being set?
Thanks in advance!
Last edited: