[SOLVED] Missing chunks, how to fix - update atime failed for chunk - ENOENT: No such file or directory

grin

Renowned Member
Dec 8, 2008
172
21
83
Hungary
grin.hu
Code:
2020-08-25T10:21:57+02:00: WARN: warning: unable to access chunk xx, required by "/mnt/datastore/pub/ct/666/2020-08-21T02:33:47Z/root.pxar.didx" - update atime failed for chunk "/mnt/datastore/pub/.chunks/fa39/xxx" - ENOENT: No such file or directory

This have been killing GC (which is fixed in 0.8.11-1, https://forum.proxmox.com/threads/gc-failed-with-task-error.73573/#post-333661), but even now when GC is able to run it doesn't (and possibly cannot) clean this mess up.

First: there is no known problem: the files were stored on local filesystem, lvm+xfs, and there is no known problem of basically anything related to hw or the fs. The files are just not there and I know no reasons for that. (Most of the backups are fine.)

The files are all seem to be related to one specific CT which is pretty big: its backup takes more than 20 hours, and therefore (especially in the first few cases) may have been interrupted (through the UI), but were not force killed or otherwise violated.

There are more than 140000 chunks missing. I do not really *need* those backups *now* but let us just suppose that I would, at least I would like to
  1. salvage all the data which is still alive
  2. get rid of the error messages (and the related 150MB+ sized logfiles every day)
  3. make future backups work without error and missing chunks, since it seems that *every future backup are broken as well*
I am also open to find out where the files are gone, but I have no starting point.
 
Hi,

can you please provide some additional information, as we'd really want to solve any such issue, but we cannot yet really reproduce this here.

* what disks are used for the LVM? are they healthy?
* how big is the CT (rough estimate)
* what big files are in there, or are there just so many small ones?
* Can you see if the following command posts any output:
Bash:
find /PATH/TO/DATASTORE -type d -not -user backup
(it should not output anything)
* Further, do you frequently run GC or verifies?
* Did the missing chunks got already mentioned in the first GC or Verify run log, or did they start to appear only later on? (e.g., after an update?)

And most important: Do new backups also have (immediately) issues if you trigger a verification on that single snapshot?

The sole point where we actively remove chunks is Garbage Collection, so it would have been an issue with that.
But as it's rather simple made it could also be an issue with the underlying XFS+LVM and its atime behavior (just guessing).

I'll see if I can try to create a roughly similar setup and cause such issues with that one.
 
* what disks are used for the LVM? are they healthy?​

Dell PERC 740p hardware RAID-6 array, 2 PVs, 1 VG/LV, 18TB in size. Everyone's healthy and happy.

* how big is the CT (rough estimate)​

1.81 TB, give or take a few gigs.

* what big files are in there, or are there just so many small ones?​

It is mainly a mailserver, with maildir++ files, a few dozen millions of them, size ranges from a kilobyte up to dozens of megabytes.

* Can you see if the following command posts any output:
Bash:
find /PATH/TO/DATASTORE -type d -not -user backup
(it should not output anything)​

All files are owned by backup.

* Further, do you frequently run GC or verifies?​

Daily, but this is a fairly new install (a week or so old) so I believe GC was never able to run (due to the bug before v...11), and since it can (a few days ago) it runs daily and reports the errors. Verifies were only ran manually and very occasionally.

* Did the missing chunks got already mentioned in the first GC or Verify run log, or did they start to appear only later on? (e.g., after an update?)​

This is a tough question since the logs seem to be scattered all over the directories and files.... I'll try to trace it back unless the logs were deleted.

And most important: Do new backups also have (immediately) issues if you trigger a verification on that single snapshot?​

Yes, both verify and gc almost immediately starts complaining about missing chunks.

The sole point where we actively remove chunks is Garbage Collection, so it would have been an issue with that.​

The system was started about a week ago so there weren't any particularly ancient version on it ever. In the first tests I have started GC and verify manually a few times, but I do not remember any errors. The large backup was interrupted a few times though, incremental or base I cannot remember, but I'll try to scrap the logs.

But as it's rather simple made it could also be an issue with the underlying XFS+LVM and its atime behavior (just guessing).​

But you still suspect that GC removed the files based on wrong atime? Its mounted as:

Code:
/dev/mapper/pbs-datastore on /mnt/datastore type xfs (rw,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=2048,noquota)

and relatime shall work.
 
Dell PERC 740p hardware RAID-6 array, 2 PVs, 1 VG/LV, 18TB in size. Everyone's healthy and happy.

Hmm, I experienced some HW RAID weirdness, but immediately going missing data would be even super weird/bad for them..

Yes, both verify and gc almost immediately starts complaining about missing chunks.

OK, that's definitively weird!

But you still suspect that GC removed the files based on wrong atime? Its mounted as:

Code:
/dev/mapper/pbs-datastore on /mnt/datastore type xfs (rw,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=2048,noquota)

and relatime shall work.

Yeah, we can cope with relatime.

So, as you were mentioning that it seems like chunks getting lost immediately makes me less unsure about GC being the culprit...

Do you have any other test storage which you can use to test a single backup of that CT + an immediate verify? The simpler the storage would be the better (maybe a single 2 TB disk or so). That could really help to rule out some things.
 
So, there was the successful GC on 08-19 00:02.

One minute before that there was a
Code:
2020-08-19T00:00:00+02:00: starting garbage collection on store pub         
                                                                                                                                  
2020-08-19T00:00:00+02:00: task triggered by schedule 'daily'                                                                                                                                                                         
2020-08-19T00:00:00+02:00: Start GC phase1 (mark used chunks)                                                                                                                                                               

2020-08-19T00:01:37+02:00: TASK ERROR: Unable to open dynamic index "/mnt/datastore/pub/ct/200/2020-08-18T05:21:42Z/root.pxar.didx" - No such file or directory (os error 2)
but it seems not to bother the next one.

then 08-19 went on and there were piles of good and some bad backups, bad meaning
Code:
2020-08-19T00:26:13+02:00: created new dynamic index 1 ("ct/10107/2020-08-18T22:26:11Z/root.pxar.didx")                                                            
2020-08-19T00:26:13+02:00: created new dynamic index 2 ("ct/10107/2020-08-18T22:26:11Z/catalog.pcat1.didx")                                                                   
2020-08-19T00:40:08+02:00: backup ended and finish failed: backup ended but finished flag is not set.                                                                                   

2020-08-19T00:40:08+02:00: removing unfinished backup                                                                                                                                                                    

2020-08-19T00:40:08+02:00: TASK ERROR: backup ended but finished flag is not set.
or similar incremental, aborted ones (mainly due to not having permission to --prune 1 on proxmox side).

Nothing which would look particularly nasty.
Then there were the daily prunes on 08-20 00:00.
Then another successful GC, on 08-20 00:04, removing about 51300 chunks.
Then backups the next day, prune.

Then 08-21T00:03:
Code:
2020-08-21T00:00:00+02:00: starting garbage collection on store pub                                                                                                                                         

2020-08-21T00:00:00+02:00: task triggered by schedule 'daily'                                                                                                                                                                    

2020-08-21T00:00:00+02:00: Start GC phase1 (mark used chunks)                                                                                                                                                             

2020-08-21T00:03:13+02:00: TASK ERROR: unable to access chunk fa39eafd5ae27f5fec4caf87bd18b37fdd0825e39a10557d5dbe5ef874ed66f5, required by "/mnt/datastore/pub/ct/900/2020-08-19T02:31:19Z/root.pxar.didx" - update atime failed for chunk "/mnt/datastore/pub/.chunks/fa39/fa39eafd5ae27f5fec4caf87bd18b37fdd0825e39a10557d5dbe5ef874ed66f5" - ENOENT: No such file or directory

Latest backup of ct/900 was started on 08-19T04:31 (02:31 UTC) and successfully finished on 08-20T16:07.
 
Last edited:
Hmm, I experienced some HW RAID weirdness, but immediately going missing data would be even super weird/bad for them..

OK, that's definitively weird!
When you create a new, incremental backup, and you have a chunk with H1 hash, which is already known to the system, do you actually verify its existence in the disk or just reference it (based on some hatabase of known chunks)?

I suspect that H1 was in the old backup and the file is missing, and new backup also refreences H1 while not checking whether it actually exists on the storage or not.
 
When you create a new, incremental backup, and you have a chunk with H1 hash, which is already known to the system, do you actually verify its existence in the disk or just reference it (based on some hatabase of known chunks)?

Not yet, we'll include the verify result of the last backup snapshot, used for incremental backup, and avoid using it if not OK in a next version.

I suspect that H1 was in the old backup and the file is missing, and new backup also refreences H1 while not checking whether it actually exists on the storage or not.

Can you try pruning all backup snapshots of that group, and then retry a new backup and verify that one?
 
Can you try pruning all backup snapshots of that group, and then retry a new backup and verify that one?
Sure, and I'm pretty sure it will be fixed then.
That's the only way right now to fix it anyway....

Hmm, I am not sure this GC looks very promising though:
Code:
2020-09-01T21:32:46+02:00: Removed bytes: 8824296342
2020-09-01T21:32:46+02:00: Removed chunks: 4177
2020-09-01T21:32:46+02:00: Pending removals: 1023347984441 bytes (375680 chunks)
2020-09-01T21:32:46+02:00: Original data bytes: 23281273825085
2020-09-01T21:32:46+02:00: Disk bytes: 1209227755943 (5 %)
2020-09-01T21:32:46+02:00: Disk chunks: 468212
2020-09-01T21:32:46+02:00: Average chunk size: 2582650

So if I understand correctly these have atime too recent and they are...? Are they removed from the list of valid chunks? Will they be removed only by next GC (which cannot remove them because the new backup will use them if they aren't invalidated already)?
 
Last edited:
So if I understand correctly these have atime too recent and they are...? Are they removed from the list of valid chunks? Will they be removed only by next GC (which cannot remove them because the new backup will use them if they aren't invalidated already)?

taken from the reference docs (https://pbs.proxmox.com/docs/administration-guide.html#id2 )
The garbage collection works in two phases. In the first phase, all data blocks that are still in use are marked. In the second phase, unused data blocks are removed.

but that was not the issue here:
when doing a backup, the client gets the last successful backup index which contains a list of chunks the server already has (which is not actually true for your case)
and those chunks gets never sent to the server (as an optimization)

so if you prune all snapshots (regardless of gc or not) a new backup has no 'known chunks' list and sends all chunks again

there are some patches on the list/in the works to improve corrupted/missing chunks
(e.g. during a verify, snapshots with missing/corrupted chunks get marked and the client does not reuse those chunks lists)
 
So, I have pruned all the backup for the specific container, and a whole [new] backup was created.
Started verification. It ran for a few hours then started to find errors, again, though much less in numbers.
I really wonder what and when deletes these chunks, since it seems to consistently destroy them.
It seem to be unreliable for this CT (backup takes 24+ hours now, but it looks like being successful from the logs).
 
The 24 hours+ duration could be a cause of problems here, I mean at least if a GC job also run?

Yes, and and it fails first usually:
2020-09-07T00:00:32+02:00: TASK ERROR: Unable to open dynamic index "/mnt/datastore/pub/ct/106/2020-08-20T22:22:02Z/root.pxar.didx" - No such file or directory (os error 2)

Then comes again in 1 minute and report a lot of missing chunks. Probably killing them like an angry zombie army.

Stefan is working on improving the situation regarding GCs and some other cases where verification finds corrupt chunks: https://lists.proxmox.com/pipermail/pbs-devel/2020-September/000569.html
Nice. Is there some release announce thread, where people don't talk too much? ;-)
 
Nice. Is there some release announce thread, where people don't talk too much? ;-)

No, at least not yet. Simplest you can do is daily dose of:
Bash:
apt update
apt full-upgrade
(or the webinterface equivalent of this) and then read the changelog apt changelog proxmox-backup-server :)
 
  • Like
Reactions: grin
Is pruning this the only way to force a send of new chunks?
I normally backup to PBS using snapshots. I tried with Stop VM, but it still resulted in errors...
I am consistently getting a missing chunk error

Code:
can't verify chunk, load failed - store 'store3', unable to load chunk '78bf415f553f2c2704c65302a156e6a993710bc0be719d678243056d2e3c1003' - blob too small (0 bytes).
2020-09-07T15:01:21-04:00:   verified 35549.01/101728.00 Mib in 714.20 seconds, speed 49.77/142.44 Mib/s (1 errors)
2020-09-07T15:01:21-04:00: verify store3:vm/107/2020-09-07T14:45:31Z/drive-scsi0.img.fidx failed: chunks could not be verified
2020-09-07T15:01:21-04:00:   check drive-scsi1.img.fidx
 
NO idea what just happened. I just ran another backup using Stopped after several attempts throughout the day of Snapshot backups. However this time the verification succeeded and there was no errors....

Code:
2020-09-07T16:53:57-04:00: verify store3:vm/107/2020-09-07T20:15:14Z
2020-09-07T16:53:57-04:00:   check qemu-server.conf.blob
2020-09-07T16:53:57-04:00:   check drive-efidisk0.img.fidx
2020-09-07T16:53:57-04:00:   verified 0.00/0.12 Mib in 0.03 seconds, speed 0.06/3.77 Mib/s (0 errors)
2020-09-07T16:53:57-04:00:   check drive-scsi0.img.fidx
2020-09-07T17:19:53-04:00:   verified 35543.93/101732.00 Mib in 1555.08 seconds, speed 22.86/65.42 Mib/s (0 errors)
2020-09-07T17:19:53-04:00:   check drive-scsi1.img.fidx
2020-09-07T17:34:02-04:00:   verified 46971.07/108196.00 Mib in 848.87 seconds, speed 55.33/127.46 Mib/s (0 errors)
2020-09-07T17:34:02-04:00: TASK OK

I'm not sure what occured, but I suppose it's possible to have a successful backup after a failed one.
Restored it to a different node off site and it works.
 
So, I have pruned all the backup for the specific container, and a whole [new] backup was created.
Started verification. It ran for a few hours then started to find errors, again, though much less in numbers.
I really wonder what and when deletes these chunks, since it seems to consistently destroy them.
It seem to be unreliable for this CT (backup takes 24+ hours now, but it looks like being successful from the logs).

Did you restarted any backup server daemons, or made an update which did so during that time?
 
Hi Guys , I have just run into the same problem with my datastore. I am recieving the message below"
Bad Request (400)
unable to open chunk store 'Backup_Drive' at "/media/usb-backupdrive/.chunks" - No such file or directory (os error 2)
I have confirmed the drive is mounted but i still keep getting this message. Any assistance would be appreciated
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!