Failing to fetch chunks on **every** backup. 4.0.18 S3 datastore

dw-cj

Member
Apr 9, 2024
35
2
8
I am unable to restore any backup on my S3 datastore on 4.0.18 without 'missing chunk' errors. We are using a Backblaze B2 S3 datastore with a local SSD cache folder.

As a controlled test, I created a brand new Debian VM, and took a backup of it. The backup completed successfully with no error. Immediately after, I try to restore the backup, and it fails with a missing chunk error. I get these errors on ANY backup I try to restore.

I have attached the logs from PBS of the backup job, and the restore job. It appears to be like this every time:
Code:
2025-10-30T15:39:25-04:00: GET /chunk: 400 Bad Request: could not fetch object with key 513f8cc92061882c224f679ed74bd0237b99f2472b4cf93c398035ead58b00a9

I hope a developer can take a look at this and provide some insight, we haven't been able to restore any backups.
 

Attachments

I've analyzed the logs and found a clear discrepancy between the backup and restore jobs.

The backup log shows it's skipping a chunk because it believes the chunk already exists on the datastore:
Code:
2025-10-30T15:35:35-04:00: Skip upload of already encountered chunk 513f8cc92061882c224f679ed74bd0237b99f2472b4cf93c398035ead58b00a9

However, the restore log from just minutes later shows the job failing because it cannot find that exact same chunk:
Code:
2025-10-30T15:39:25-04:00: GET /chunk: 400 Bad Request: could not fetch object with key 513f8cc92061882c224f679ed74bd0237b99f2472b4cf93c398035ead58b00a9

This suggests an inconsistency where the backup process incorrectly skips uploading chunks that are actually missing, causing all my restores to fail. Hope this helps narrow down the issue.
 
Last edited:
This is a known issue, the bugfix already being worked on https://bugzilla.proxmox.com/show_bug.cgi?id=6961

Further, this was already mentioned as response to your previous post, see https://forum.proxmox.com/threads/s3-buckets-constantly-failing-verification-jobs.169875/post-811334
I assumed that this was a different issue, because the backup was taken on the latest build (4.0.18), where the chunk renaming issue was already patched.

Is there any current workaround? I tried restarting proxmox-backup-proxy and then restoring, but the issue is the same still. Pardon my lack of knowledge, I don't exactly understand the mechanics behind this bug.
 
I assumed that this was a different issue, because the backup was taken on the latest build (4.0.18), where the chunk renaming issue was already patched.
The issue here is a slightly different one than the one already fixed. The seemingly corrupt chunk was not evicted from the in-memory cache which avoids re-upload of already seen chunks.
Is there any current workaround? I tried restarting proxmox-backup-proxy and then restoring, but the issue is the same still. Pardon my lack of knowledge, I don't exactly understand the mechanics behind this bug.
The workaround is the following: restart the proxmox-backup-proxy (which clears the in-memory cache), create a new backup (you should see no more skipped because known in the backup task log) and you should be able to restore that new snapshots... Older ones might get healed as well if the same referenced chunk data is being re-uploaded by the new backup, but there is no guarantee for that. E.g. if data changed in-between the backups, different chunks might now be created and uploaded than the ones previously marked as corrupt.
 
The issue here is a slightly different one than the one already fixed. The seemingly corrupt chunk was not evicted from the in-memory cache which avoids re-upload of already seen chunks.

The workaround is the following: restart the proxmox-backup-proxy (which clears the in-memory cache), create a new backup (you should see no more skipped because known in the backup task log) and you should be able to restore that new snapshots... Older ones might get healed as well if the same referenced chunk data is being re-uploaded by the new backup, but there is no guarantee for that. E.g. if data changed in-between the backups, different chunks might now be created and uploaded than the ones previously marked as corrupt.
I followed this procedure to test, and still have the same issue as I initially described.

- Created a new VM.
- Create a 5GB file within the VM.
- Run service proxmox-backup-proxy restart on the PBS server.
- Start a backup.
- Backup completes, try to restore the backup.

It fails in the same way as my original test on my original post. The backup task log also still has plenty of 'Skip upload of already encountered chunk' messages.


Code:
root@nypbs:~# proxmox-backup-manager version --verbose
proxmox-backup                      4.0.0         running kernel: 6.14.11-3-pve
proxmox-backup-server               4.0.18-1      running version: 4.0.18
 
Then try to clear also the local store cache before doing the proxy restart and the backup by running find <path-to-cache>/.chunks -type f -delete. Be careful as this will delete all files within the given folder, so you might want to check the output of find <path-to-cache>/.chunks -type f -print first.
And also, make sure to stop the VM first or let the previous backup snapshot be verified to be invalid. Otherwise also the dirty bitmap tracking optimization will come into play, avoiding re-uploads as well
 
Last edited:
Then try to clear also the local store cache before doing the proxy restart and the backup by running find <path-to-cache>/.chunks -type f -delete. Be careful as this will delete all files within the given folder, so you might want to check the output of find <path-to-cache>/.chunks -type f -print first.
And also, make sure to stop the VM first or let the previous backup snapshot be verified to be invalid. Otherwise also the dirty bitmap tracking optimization will come into play, avoiding re-uploads as well
Instead of deleting the chunks, I just removed the datastore entirely, and created a new one with a fresh local cache directory (using the 'Reuse existing datastore' and 'Overwrite in-use Marker' checkbox). I am now able to successfully create, and then immediately restore a backup with no errors.

What does this mean for the backups I've taken over the past few weeks? Should I expect them to be mostly corrupt and not have uploaded chunks properly, or is the issue strictly related to the local cache? Will backups taken from here on out upload properly?
 
Instead of deleting the chunks, I just removed the datastore entirely, and created a new one with a fresh local cache directory (using the 'Reuse existing datastore' and 'Overwrite in-use Marker' checkbox). I am now able to successfully create, and then immediately restore a backup with no errors.
That is a valid workaround as well, yes. Although at the additional cost to re-download the metadata files.

What does this mean for the backups I've taken over the past few weeks? Should I expect them to be mostly corrupt and not have uploaded chunks properly, or is the issue strictly related to the local cache?
This depends, most likely they are corrupt, unless the new successful backup uploaded some of the chunks to heal them. What you can do however is check for files ending in .0.bad in your bucket and rename these to drop the extensions if no filename without that extension is present. This should at least recover the snapshots for which the chunks have incorrectly been flagged as bad by the rename bug. Once that is done, a verification job of larger batches (e.g. full namespace or even full datastore) will tell which are still corrupt nevertheless. Please be aware however, that verification will fetch the chunk data, so this causes requests and potentially egress fees, depending on your provider. Monitoring is recommended.

Will backups taken from here on out upload properly?
Yes, unless your verification job runs into a bad chunk again (this time not caused by transient errors). If that is the case, and the chunk is still or again cached, then uploading it will again run into this cache inconsistency bug. Please subscribe to the issue to get status notifications on the progress regarding the bugfix patches and when the bugfix will be packaged.