Missing chunks again

Oct 9, 2025
8
0
1
For the second time in a couple of weeks, when we try to view our file restore options, we find some VMs are unable to load the file restore view. We look at the logs and see lots of messages to the affect of "error during PBS read: command error: reading file [chunk path here] failed: No such file or directory (os error 2)"

When I check the path to our datastore, those chunks are indeed missing. Our datastore is mounted via CIFS to a share on an EMC Data Domain, and as far as we know the Data Domain is not experiencing any other issues.

Can someone provide some guidance on how we could resolve this issue? Thanks!
 
For the second time in a couple of weeks, when we try to view our file restore options, we find some VMs are unable to load the file restore view. We look at the logs and see lots of messages to the affect of "error during PBS read: command error: reading file [chunk path here] failed: No such file or directory (os error 2)"
Did that snapshot you are trying to restore from was ever verified successfully after your previous report [0]? What steps were taken after your initial issues? Further, do you have the gc-atime-safety-check enabled [1]? Also share the output of the last garbage collection task log.

[0] https://forum.proxmox.com/threads/i...-when-trying-to-do-file-level-restore.173573/
[1] https://pbs.proxmox.com/docs/storage.html#tuning
 
Did that snapshot you are trying to restore from was ever verified successfully after your previous report [0]? What steps were taken after your initial issues? Further, do you have the gc-atime-safety-check enabled [1]? Also share the output of the last garbage collection task log.

[0] https://forum.proxmox.com/threads/i...-when-trying-to-do-file-level-restore.173573/
[1] https://pbs.proxmox.com/docs/storage.html#tuning
Regarding that post I made before [0], I deleted those backups entirely, ran new full backups, and then verified them. They verified successfully and I was able to view file level restores at the time. Now I'm looking at backups after they've run for a week or so and running into a slightly different issue (not getting a connection error pop-up, but the list of files fails to load)

Here is what I see regarding the GC atime safety check:
1761140778629.png

I've attached the most recent garbage collection log here. Thanks!
 

Attachments

Now I'm looking at backups after they've run for a week or so and running into a slightly different issue (not getting a connection error pop-up, but the list of files fails to load)
Hi. As Chris asked:
"Did that snapshot you are trying to restore from was ever verified successfully"?

I'm not asking about the first new backup (which you verified OK).
I'm asking about the next snaphots (backups) - from which you are trying to view file level restores.

Sorry, @Chris , for interfering ;-). I'll ask because I'm learning myself while trying to help others:

I'm finding garbage collection log posted by mhentrich - strange at its end.
This excerpt:

2025-10-18T20:18:47-05:00: Removed garbage: 739.837 GiB
2025-10-18T20:18:47-05:00: Removed chunks: 517741
2025-10-18T20:18:47-05:00: Pending removals: 50.072 GiB (in 27386 chunks)
2025-10-18T20:18:47-05:00: Original data usage: 11.299 TiB
2025-10-18T20:18:47-05:00: On-Disk usage: 0 B (0.00%)
2025-10-18T20:18:47-05:00: On-Disk chunks: 0
2025-10-18T20:18:47-05:00: Deduplication factor: 1.00

Isn't this On-Disk usage: 0 B (0.00%) and On-Disk chunks: 0 strange?
Does it mean that the occupied space is empty? That is, there is no backup here in fact?
Or am I misunderstanding the GC task log?
 
  • Like
Reactions: Johannes S
Sorry, @Chris , for interfering ;-).
;) no worries

I've attached the most recent garbage collection log here. Thanks!
Thanks for the logs, there is indeed something off during garbage collection. Although the access time check passes and the chunks being touched, which can be seen from the cache misses, most of the chunks are being removed for some reason. This in turn leads to your restores failing.

Please share the output of mount from the PBS host, for smb mounts you should set the cache=strict parameter, see also https://forum.proxmox.com/threads/n...ls-at-99-time-safety-check.165764/post-768710

What is a bit strange in your case is that the atime check, which was introduced to protect against such unexpected behavior by the storage system, does pass.

Please execute the following commands from within your datastore location
Code:
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

Please post the full output of the commands.

If that particular chunk is not present, you can do the same on any other one found by find .chunks/ -type f -print.
 
;) no worries


Thanks for the logs, there is indeed something off during garbage collection. Although the access time check passes and the chunks being touched, which can be seen from the cache misses, most of the chunks are being removed for some reason. This in turn leads to your restores failing.

Please share the output of mount from the PBS host, for smb mounts you should set the cache=strict parameter, see also https://forum.proxmox.com/threads/n...ls-at-99-time-safety-check.165764/post-768710

What is a bit strange in your case is that the atime check, which was introduced to protect against such unexpected behavior by the storage system, does pass.

Please execute the following commands from within your datastore location
Code:
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

Please post the full output of the commands.

If that particular chunk is not present, you can do the same on any other one found by find .chunks/ -type f -print.

The cache setting appears correct:

Code:
root@pbs:/mnt/dd04# cat /proc/mounts | grep cifs
//10.X.X.X/t3315-pbs /mnt/dd04 cifs rw,relatime,vers=2.1,cache=strict,upcall_target=app,username=XXXX,uid=34,forceuid,gid=34,forcegid,addr=10.240.0.40,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,reparse=nfs,nativesocket,symlink=native,rsize=1048576,wsize=1048576,bsize=1048576,retrans=1,echo_interval=60,actimeo=1,closetimeo=1 0 0

Here is the output of those commands:

Code:
root@pbs:/mnt/dd04# stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  File: .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  Size: 159             Blocks: 1          IO Block: 1048576 regular file
Device: 0,36    Inode: 2205775     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (   34/  backup)   Gid: (   34/  backup)
Access: 2025-10-19 01:00:04.613078000 -0500
Modify: 2025-10-19 01:00:04.613078000 -0500
Change: 2025-10-24 03:34:14.356938000 -0500
 Birth: 2025-10-19 01:00:04.544030000 -0500

Code:
root@pbs:/mnt/dd04# strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

utimensat(0, NULL, [UTIME_NOW, UTIME_OMIT], 0) = 0
+++ exited with 0 +++

Code:
root@pbs:/mnt/dd04# stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

  File: .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  Size: 159             Blocks: 1          IO Block: 1048576 regular file
Device: 0,36    Inode: 2205775     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (   34/  backup)   Gid: (   34/  backup)
Access: 2025-10-19 01:00:04.613078000 -0500
Modify: 2025-10-19 01:00:04.613078000 -0500
Change: 2025-10-24 12:19:34.226857000 -0500
 Birth: 2025-10-19 01:00:04.544030000 -0500

Let me know what else you'd like me to check!
 
So it seems that your storage does not update the access time on utimensat calls. This is however crucial for garbage collection to work [0]. What is unexpected is however that the access time update safety check, introduced exactly to detect such storages and refuse to run garbage collection in that case, does seem to pass on your storage. Did your GC ever fail for this reason?

Check if you have some tuning knobs available on the EMC Data Domain to explicitly enable/disable access time updates. Further, instead of actimeo=1 try disabling attribute caching by setting actimeo=0.

[0] https://pbs.proxmox.com/docs/maintenance.html#gc-background
 
  • Like
Reactions: Johannes S
FWIW, a couple weeks ago I tried to use a Dell DD3300, quite similar to OPs EMC Data Domain storage, and it refused to update access time neither via NFS nor CIFS. In my case, PBS 4 did show an error during datastore creation and refused to create the datastore. When testing with stat + touch -a, there were no errors at all but it did not update access time regardless of mount options. Internet searches essentially say that those Data Domain storages are not fully POSIX compliant, hence the behavior I saw. Read some of the official documentation and did not find an statement citing whether or not access time updating via NFS or CIFS is supported. I've asked Dell about this, but no reply yet.
 
@VictorSTS and @Chris - what happens if we disable garbage collection? If I'm understanding the problem fully, because the timestamps are failing to update, garbage collection is deleting chunks that the backup chain needs. If we disabled GC, what would be the long term effect?
 
what would be the long term effect?
You won't be able to recover space from expired backups and your datastore will eventually become full. GC must work for PBS te behave as it is designed.

@Chris , would it be possible to implement an alternate GC that uses modify or creation timestamp instead of access timestamp? In my tests with the DD3300, modify or creation timestamp did update correctly and would also update access time.
 
You won't be able to recover space from expired backups and your datastore will eventually become full. GC must work for PBS te behave as it is designed.

@Chris , would it be possible to implement an alternate GC that uses modify or creation timestamp instead of access timestamp? In my tests with the DD3300, modify or creation timestamp did update correctly and would also update access time.
Gotcha - I must be misunderstanding, I thought prune jobs deleted expired backups?
 
Gotcha - I must be misunderstanding, I thought prune jobs deleted expired backups?
No :).

https://pbs.proxmox.com/docs/maintenance.html :

"Prune lets you specify which backup snapshots you want to keep, removing others.When pruning a snapshot, only the snapshot metadata (manifest, indices, blobs,log and notes) is removed. The chunks containing the actual backup data and previously referenced by the pruned snapshot, have to be removed by a garbage collection run. [...]"
 
Extending on previous reply, what GC does is:
  • Phase 1: read the index of each backup that contains the list of chunks used by that backup snapshot. Then updates access time on each chunk of that backup snapshot. It does so with each backup snapshot in that datastore (there's only one GC job per datastore).
  • Phase 2: check each chunk access time and if it is older than 24 hours 5 minutes, delete it -> if a chunk doesn't have it's access time updated means that Phase 1 did not update it because it is no longer in use by any backup snapshot in the datastore, so it can be safely removed.
If the storage don't properly update access time, GC will wrongly remove still-in-use chunks.