S3 Buckets constantly failing verification jobs

But you are getting an error because the chunk is actually missing on the S3 object store, so this is not a bug but that snapshot being corrupt. Can you maybe see if that chunk was incorrectly flagged as corrupt by checking if a corresponding object with a .0.bad extension exists on your S3 object store to be located in <datastore-name>/.chunks/3cef/. If that is the case, try renaming that object by dropping the extension and do a re-verify of that snapshot.
Is this fix included in 4.0.16? Still having the same issue on this version with Backblaze, but I haven't deleted and '.bad' chunks yet or looked into that.
 
Is this fix included in 4.0.16? Still having the same issue on this version with Backblaze, but I haven't deleted and '.bad' chunks yet or looked into that.
The fix unfortunately did not make it into the package as expected, see https://bugzilla.proxmox.com/show_bug.cgi?id=6665#c4

But independent from that, what exact errors do you get during verification? Please post the full verify task log.

Further, I do suggest to rename the chunks currently marked as bad by dropping the .0.bad extension via external tooling and then try the re-verify of snapshots currently marked as corrupt. If the chunks are truly bad, they will of course be renamed nevertheless.
 
The fix unfortunately did not make it into the package as expected, see https://bugzilla.proxmox.com/show_bug.cgi?id=6665#c4

But independent from that, what exact errors do you get during verification? Please post the full verify task log.

Further, I do suggest to rename the chunks currently marked as bad by dropping the .0.bad extension via external tooling and then try the re-verify of snapshots currently marked as corrupt. If the chunks are truly bad, they will of course be renamed nevertheless.
Is there any way to reset the state of verification jobs without having to manually delete the .bad chunks? My B2 bucket is so large that it's taking forever to list all of the chunk files.
 
The fix unfortunately did not make it into the package as expected, see https://bugzilla.proxmox.com/show_bug.cgi?id=6665#c4

But independent from that, what exact errors do you get during verification? Please post the full verify task log.

Further, I do suggest to rename the chunks currently marked as bad by dropping the .0.bad extension via external tooling and then try the re-verify of snapshots currently marked as corrupt. If the chunks are truly bad, they will of course be renamed nevertheless.
I switched to 4.0.15-1 from pbs-test and tried verifying a few random backups that have never had verification attempted before. Its still erroring almost immediately on all of them. See attached. This is Backblaze B2 bucket. Log attached.
 

Attachments

But this is expected if the chunk is no longer present as indicated by your verify task log, e.g. because it has been incorrectly flagged as bad. Please note that chunks can be shared for consecutive backups, the backup client will reuse already known chunks from the last snapshot in the group if the snapshot was verified or the verify state is unknown. This fast incremental mode allows to only upload new data chunks. But it is not checked if that chunk is actually present, which would defy the purpose and speedup of this fast incremental method.

As attempt to recover from your current situation, I suggest you first set the datastore into maintenance mode offline and then rename all the .0.bad chunks you currently have. You must do this on both, the local datastore cache and the s3 bucket, dropping their .0.bad filename extension. For example, from your log output we see a currently missing chunk 70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033. You should find a file/object with <your-base-path>/.chunks/70b9/70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033.0.bad in both, your bucket and local cache. Repeat that for all bad chunks.

This will make sure that chunks which got previously renamed incorrectly because of the bug to be present again. Once the renaming is done, you should clear the maintenance mode again and run a verification job for at least the last snapshot of each backup group. By this, either the snapshot will be verified or fail verification. If it failed verification, subsequent backup jobs will not reuse known chunks, but re-upload them. This could also heal some of the already present snapshots, if they referenced the same chunks.

If you do sync jobs, you can also run a pull sync job with the re-sync corrupt, which will re-sync also the snapshots currently marked as corrupt.

Last but not least, you might want to hold of until the packaged fix reaches your system, as any transient networking issue can lead to the chunks being incorrectly renamed.
 

But this is expected if the chunk is no longer present as indicated by your verify task log, e.g. because it has been incorrectly flagged as bad. Please note that chunks can be shared for consecutive backups, the backup client will reuse already known chunks from the last snapshot in the group if the snapshot was verified or the verify state is unknown. This fast incremental mode allows to only upload new data chunks. But it is not checked if that chunk is actually present, which would defy the purpose and speedup of this fast incremental method.

As attempt to recover from your current situation, I suggest you first set the datastore into maintenance mode offline and then rename all the .0.bad chunks you currently have. You must do this on both, the local datastore cache and the s3 bucket, dropping their .0.bad filename extension. For example, from your log output we see a currently missing chunk 70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033. You should find a file/object with <your-base-path>/.chunks/70b9/70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033.0.bad in both, your bucket and local cache. Repeat that for all bad chunks.

This will make sure that chunks which got previously renamed incorrectly because of the bug to be present again. Once the renaming is done, you should clear the maintenance mode again and run a verification job for at least the last snapshot of each backup group. By this, either the snapshot will be verified or fail verification. If it failed verification, subsequent backup jobs will not reuse known chunks, but re-upload them. This could also heal some of the already present snapshots, if they referenced the same chunks.

If you do sync jobs, you can also run a pull sync job with the re-sync corrupt, which will re-sync also the snapshots currently marked as corrupt.

Last but not least, you might want to hold of until the packaged fix reaches your system, as any transient networking issue can lead to the chunks being incorrectly renamed.
Thanks for the info. I understand that previously verified backups (verified on the bad version) will need to be fixed like this. But, I'm trying to verify random backups, ones that have **never** had a verify job run on them on before, and it still fails with these same missing chunk errors every time. I'm on 4.0.15-1 from pbs-test. Shouldn't new, never before verified backups work?
 
Thanks for the info. I understand that previously verified backups (verified on the bad version) will need to be fixed like this. But, I'm trying to verify random backups, ones that have **never** had a verify job run on them on before, and it still fails with these same missing chunk errors every time. I'm on 4.0.15-1 from pbs-test. Shouldn't new, never before verified backups work?
Newly created snapshots might reuse pre-existing chunks from the previous backup snapshot in that group. So if that snapshot has not yet been verified at the time when the new backup snapshot was created, it can reindex chunks, which however have been moved by another unrelated verify job. You have to break the chain by verifying the last snapshot of a group. If the next backup sees this snapshot as verify failed, it will not reuse chunk, but rather re-upload them.
 
Newly created snapshots might reuse pre-existing chunks from the previous backup snapshot in that group. So if that snapshot has not yet been verified at the time when the new backup snapshot was created, it can reindex chunks, which however have been moved by another unrelated verify job. You have to break the chain by verifying the last snapshot of a group. If the next backup sees this snapshot as verify failed, it will not reuse chunk, but rather re-upload them.
I understand this, I'm trying backups from 'groups' that have never been verified. I haven't had a single verification job pass on PBS 4.0, ever. I don't run verify jobs on any schedule or cron, 99% of the VM 'groups' have never had verification attempted. I've attempted to verify at least 5 manually, all on groups that have never had a verify job run on any of the backups, and all still fail with these same errors (4.0.15-1).
 
Did you check if the chunks which are reported as missing are present on the S3 object store as suggested? And if they are present on the local datastore cache? For example, to check the local datastore cache please run stat <your-cache-base-path>/.chunks/70b9/70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033.0.bad and stat <your-cache-base-path>/.chunks/70b9/70b99ddaaca175fedde2c67c57be08f454ae4736969272697e05db44421aa033

And check the bucket contents of <your-datastore-name>/.chunks/70b9/, e.g. in the backblaze web interface.
 
Here's one instance of what I believe to be an issue with the current s3 verification: it seems to interpret transient s3 errors as permanent, marking the affected chunk as corrupted, instead of simply retrying. I think this behavior could be changed to only interpret certain status codes as permanent, i.e. 404.

Code:
2025-10-10T22:23:13-05:00: verify s3:vm/105/2025-10-10T10:01:16Z
2025-10-10T22:23:13-05:00:   check qemu-server.conf.blob
2025-10-10T22:23:13-05:00:   check drive-scsi0.img.fidx
2025-10-10T22:26:36-05:00: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>cloudflare</center>
</body>
</html>

2025-10-10T22:26:36-05:00: "can't verify chunk, load failed - unexpected status code 502 Bad Gateway"
2025-10-10T22:26:37-05:00: corrupted chunk renamed to "/mnt/datastore/cache/.chunks/2dd4/2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb.0.bad"
2025-10-10T22:33:05-05:00:   verified 3550.77/12160.00 MiB in 591.65 seconds, speed 6.00/20.55 MiB/s (1 errors)
2025-10-10T22:33:05-05:00: verify s3:vm/105/2025-10-10T10:01:16Z/drive-scsi0.img.fidx failed: chunks could not be verified
 
Oh, I actually see that this has been addressed in a recent commit. Great!

https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=3c350f358e2c2c5513bd87d6ec7bc698677cc7f1

Edit:

I just want to make sure this case is covered: I read in this thread that the expectation after this error is that a verification should invalidate the chunk, and a re-backup of the group should replace the missing/corrupted chunk. I do not believe this is the case. Here is what I'm observing:

1. 2x verification fails on the chunk which originally was marked as 'bad' due to the transient error.

Code:
2025-10-14T14:46:24-05:00: verify s3:vm/105/2025-10-10T10:01:16Z
2025-10-14T14:46:24-05:00:   check qemu-server.conf.blob
2025-10-14T14:46:24-05:00:   check drive-scsi0.img.fidx
2025-10-14T14:50:31-05:00: "can't verify missing chunk with digest 2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb"
2025-10-14T14:50:32-05:00: failed to copy corrupt chunk on s3 backend: 2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb
2025-10-14T15:01:45-05:00:   verified 3554.89/12172.00 MiB in 921.37 seconds, speed 3.86/13.21 MiB/s (1 errors)
2025-10-14T15:01:45-05:00: verify s3:vm/105/2025-10-10T10:01:16Z/drive-scsi0.img.fidx failed: chunks could not be verified
2025-10-14T15:01:45-05:00:   check drive-efidisk0.img.fidx
2025-10-14T15:01:45-05:00:   verified 0.02/0.52 MiB in 0.11 seconds, speed 0.17/4.81 MiB/s (0 errors)
2025-10-14T15:01:54-05:00: Failed to verify the following snapshots/groups:
2025-10-14T15:01:54-05:00:     vm/105/2025-10-10T10:01:16Z
2025-10-14T15:01:54-05:00: TASK ERROR: verification failed - please check the log for details

2. A new backup of the group is performed, and the chunk is skipped.

Code:
...snip
  2025-10-15T10:27:57-05:00: Skip upload of already encountered chunk 23ae12bf8df1d48fc528eafa0a93591c4eb01d9dec7e948b2fcd744305d38281
  2025-10-15T10:27:57-05:00: Skip upload of already encountered chunk 1514f17b2af92095967869bdfc55b64f2d630e1ab02b59d876ac42c885378a85
* 2025-10-15T10:27:57-05:00: Skip upload of already encountered chunk 2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb
  2025-10-15T10:27:57-05:00: Skip upload of already encountered chunk 359d405055b419505863eec5a8093f9313fba07e5199dadd14b2afc311280014
  2025-10-15T10:27:57-05:00: Upload of new chunk 4fbc1e0f04e34cbb8b4c3dbc2ef6374976e0da618a96548e255d82bc31eeda31
  2025-10-15T10:27:57-05:00: Upload of new chunk 9eb1b408489118f61dbaefa4f0024a72507e1cebbeceafc1b7686151792bec3d
...snip

3. Verification of new backup also fails

Code:
2025-10-15T10:28:32-05:00: verify s3:vm/105/2025-10-15T15:27:48Z
2025-10-15T10:28:32-05:00:   check qemu-server.conf.blob
2025-10-15T10:28:32-05:00:   check drive-scsi0.img.fidx
2025-10-15T10:30:33-05:00: "can't verify missing chunk with digest 2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb"
2025-10-15T10:30:33-05:00: failed to copy corrupt chunk on s3 backend: 2dd46b430eadce32946ee48f7ee9a174a32910f1456de1fa73116b876b93fccb
2025-10-15T10:37:14-05:00:   verified 3709.21/12196.00 MiB in 521.81 seconds, speed 7.11/23.37 MiB/s (1 errors)
2025-10-15T10:37:14-05:00: verify s3:vm/105/2025-10-15T15:27:48Z/drive-scsi0.img.fidx failed: chunks could not be verified
2025-10-15T10:37:14-05:00:   check drive-efidisk0.img.fidx
2025-10-15T10:37:14-05:00:   verified 0.02/0.52 MiB in 0.14 seconds, speed 0.13/3.81 MiB/s (0 errors)
2025-10-15T10:37:15-05:00: Failed to verify the following snapshots/groups:
2025-10-15T10:37:15-05:00:     vm/105/2025-10-15T15:27:48Z
2025-10-15T10:37:15-05:00: TASK ERROR: verification failed - please check the log for details
 
Last edited: