Missing chunks again

Oct 9, 2025
6
0
1
For the second time in a couple of weeks, when we try to view our file restore options, we find some VMs are unable to load the file restore view. We look at the logs and see lots of messages to the affect of "error during PBS read: command error: reading file [chunk path here] failed: No such file or directory (os error 2)"

When I check the path to our datastore, those chunks are indeed missing. Our datastore is mounted via CIFS to a share on an EMC Data Domain, and as far as we know the Data Domain is not experiencing any other issues.

Can someone provide some guidance on how we could resolve this issue? Thanks!
 
For the second time in a couple of weeks, when we try to view our file restore options, we find some VMs are unable to load the file restore view. We look at the logs and see lots of messages to the affect of "error during PBS read: command error: reading file [chunk path here] failed: No such file or directory (os error 2)"
Did that snapshot you are trying to restore from was ever verified successfully after your previous report [0]? What steps were taken after your initial issues? Further, do you have the gc-atime-safety-check enabled [1]? Also share the output of the last garbage collection task log.

[0] https://forum.proxmox.com/threads/i...-when-trying-to-do-file-level-restore.173573/
[1] https://pbs.proxmox.com/docs/storage.html#tuning
 
Did that snapshot you are trying to restore from was ever verified successfully after your previous report [0]? What steps were taken after your initial issues? Further, do you have the gc-atime-safety-check enabled [1]? Also share the output of the last garbage collection task log.

[0] https://forum.proxmox.com/threads/i...-when-trying-to-do-file-level-restore.173573/
[1] https://pbs.proxmox.com/docs/storage.html#tuning
Regarding that post I made before [0], I deleted those backups entirely, ran new full backups, and then verified them. They verified successfully and I was able to view file level restores at the time. Now I'm looking at backups after they've run for a week or so and running into a slightly different issue (not getting a connection error pop-up, but the list of files fails to load)

Here is what I see regarding the GC atime safety check:
1761140778629.png

I've attached the most recent garbage collection log here. Thanks!
 

Attachments

Now I'm looking at backups after they've run for a week or so and running into a slightly different issue (not getting a connection error pop-up, but the list of files fails to load)
Hi. As Chris asked:
"Did that snapshot you are trying to restore from was ever verified successfully"?

I'm not asking about the first new backup (which you verified OK).
I'm asking about the next snaphots (backups) - from which you are trying to view file level restores.

Sorry, @Chris , for interfering ;-). I'll ask because I'm learning myself while trying to help others:

I'm finding garbage collection log posted by mhentrich - strange at its end.
This excerpt:

2025-10-18T20:18:47-05:00: Removed garbage: 739.837 GiB
2025-10-18T20:18:47-05:00: Removed chunks: 517741
2025-10-18T20:18:47-05:00: Pending removals: 50.072 GiB (in 27386 chunks)
2025-10-18T20:18:47-05:00: Original data usage: 11.299 TiB
2025-10-18T20:18:47-05:00: On-Disk usage: 0 B (0.00%)
2025-10-18T20:18:47-05:00: On-Disk chunks: 0
2025-10-18T20:18:47-05:00: Deduplication factor: 1.00

Isn't this On-Disk usage: 0 B (0.00%) and On-Disk chunks: 0 strange?
Does it mean that the occupied space is empty? That is, there is no backup here in fact?
Or am I misunderstanding the GC task log?
 
  • Like
Reactions: Johannes S
Sorry, @Chris , for interfering ;-).
;) no worries

I've attached the most recent garbage collection log here. Thanks!
Thanks for the logs, there is indeed something off during garbage collection. Although the access time check passes and the chunks being touched, which can be seen from the cache misses, most of the chunks are being removed for some reason. This in turn leads to your restores failing.

Please share the output of mount from the PBS host, for smb mounts you should set the cache=strict parameter, see also https://forum.proxmox.com/threads/n...ls-at-99-time-safety-check.165764/post-768710

What is a bit strange in your case is that the atime check, which was introduced to protect against such unexpected behavior by the storage system, does pass.

Please execute the following commands from within your datastore location
Code:
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

Please post the full output of the commands.

If that particular chunk is not present, you can do the same on any other one found by find .chunks/ -type f -print.
 
;) no worries


Thanks for the logs, there is indeed something off during garbage collection. Although the access time check passes and the chunks being touched, which can be seen from the cache misses, most of the chunks are being removed for some reason. This in turn leads to your restores failing.

Please share the output of mount from the PBS host, for smb mounts you should set the cache=strict parameter, see also https://forum.proxmox.com/threads/n...ls-at-99-time-safety-check.165764/post-768710

What is a bit strange in your case is that the atime check, which was introduced to protect against such unexpected behavior by the storage system, does pass.

Please execute the following commands from within your datastore location
Code:
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

Please post the full output of the commands.

If that particular chunk is not present, you can do the same on any other one found by find .chunks/ -type f -print.

The cache setting appears correct:

Code:
root@pbs:/mnt/dd04# cat /proc/mounts | grep cifs
//10.X.X.X/t3315-pbs /mnt/dd04 cifs rw,relatime,vers=2.1,cache=strict,upcall_target=app,username=XXXX,uid=34,forceuid,gid=34,forcegid,addr=10.240.0.40,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,reparse=nfs,nativesocket,symlink=native,rsize=1048576,wsize=1048576,bsize=1048576,retrans=1,echo_interval=60,actimeo=1,closetimeo=1 0 0

Here is the output of those commands:

Code:
root@pbs:/mnt/dd04# stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  File: .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  Size: 159             Blocks: 1          IO Block: 1048576 regular file
Device: 0,36    Inode: 2205775     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (   34/  backup)   Gid: (   34/  backup)
Access: 2025-10-19 01:00:04.613078000 -0500
Modify: 2025-10-19 01:00:04.613078000 -0500
Change: 2025-10-24 03:34:14.356938000 -0500
 Birth: 2025-10-19 01:00:04.544030000 -0500

Code:
root@pbs:/mnt/dd04# strace -e t=utimensat -- touch -a .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

utimensat(0, NULL, [UTIME_NOW, UTIME_OMIT], 0) = 0
+++ exited with 0 +++

Code:
root@pbs:/mnt/dd04# stat .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8

  File: .chunks/bb9f/bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8
  Size: 159             Blocks: 1          IO Block: 1048576 regular file
Device: 0,36    Inode: 2205775     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (   34/  backup)   Gid: (   34/  backup)
Access: 2025-10-19 01:00:04.613078000 -0500
Modify: 2025-10-19 01:00:04.613078000 -0500
Change: 2025-10-24 12:19:34.226857000 -0500
 Birth: 2025-10-19 01:00:04.544030000 -0500

Let me know what else you'd like me to check!
 
So it seems that your storage does not update the access time on utimensat calls. This is however crucial for garbage collection to work [0]. What is unexpected is however that the access time update safety check, introduced exactly to detect such storages and refuse to run garbage collection in that case, does seem to pass on your storage. Did your GC ever fail for this reason?

Check if you have some tuning knobs available on the EMC Data Domain to explicitly enable/disable access time updates. Further, instead of actimeo=1 try disabling attribute caching by setting actimeo=0.

[0] https://pbs.proxmox.com/docs/maintenance.html#gc-background
 
  • Like
Reactions: Johannes S