S3 Buckets constantly failing verification jobs

TrustyHippo

Member
Apr 9, 2023
36
12
13
I have configured BackBlaze S3, and backups seem to run just fine. However, every day I get failed verifications. I have two local NFS/NAS datastores, and I've never seen a failed verification. I know S3 support is not GA, but this is a bit worrisome.


2025-08-16_11-13-13.jpg
 
Can you please provide the verfication task log from the failed jobs as well as the sytemd journal from the timespan of the verification job? Also, check if this is related to the issue reported here https://bugzilla.proxmox.com/show_bug.cgi?id=6665
Thanks. Here is a verification job which failed earlier. Manually re-run after failure and it worked again.

Using Minio s3 object store which is local on my network. Backups working fine - no issue from multiple nodes. Verfication jobs - sometimes fail and after re-run's pass. I was experiencing the same error when using an AWS s3 bucket so tried to use something internally to test if it could be network related but same issue.

From the task log -

Code:
()
2025-08-19T08:21:27+12:00: verify datastore pbs-s3
2025-08-19T08:21:27+12:00: found 2 groups
2025-08-19T08:21:27+12:00: verify group pbs-s3:vm/101 (1 snapshots)
2025-08-19T08:21:27+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z
2025-08-19T08:21:27+12:00:   check qemu-server.conf.blob
2025-08-19T08:21:27+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:44:45+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z/drive-scsi0.img.fidx failed: error reading a body from connection
2025-08-19T08:44:45+12:00:   check drive-efidisk0.img.fidx
2025-08-19T08:44:46+12:00:   verified 0.02/0.52 MiB in 0.17 seconds, speed 0.12/3.00 MiB/s (0 errors)
2025-08-19T08:44:46+12:00: percentage done: 50.00% (1/2 groups)
2025-08-19T08:44:46+12:00: verify group pbs-s3:vm/105 (2 snapshots)
2025-08-19T08:44:46+12:00: verify pbs-s3:vm/105/2025-08-18T13:56:17Z
2025-08-19T08:44:46+12:00:   check qemu-server.conf.blob
2025-08-19T08:44:46+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:53:53+12:00:   verified 3999.12/11292.00 MiB in 546.80 seconds, speed 7.31/20.65 MiB/s (0 errors)
2025-08-19T08:53:53+12:00: percentage done: 75.00% (1/2 groups, 1/2 snapshots in group #2)
2025-08-19T08:53:53+12:00: verify pbs-s3:vm/105/2025-08-18T10:36:24Z
2025-08-19T08:53:53+12:00:   check qemu-server.conf.blob
2025-08-19T08:53:53+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:54:14+12:00:   verified 137.71/832.00 MiB in 21.11 seconds, speed 6.52/39.41 MiB/s (0 errors)
2025-08-19T08:54:15+12:00: percentage done: 100.00% (2/2 groups)
2025-08-19T08:54:15+12:00: Failed to verify the following snapshots/groups:
2025-08-19T08:54:15+12:00:     vm/101/2025-08-18T13:30:02Z
2025-08-19T08:54:15+12:00: TASK ERROR: verification failed - please check the log for details

journalctl doesn't show much else apart from -

Code:
Aug 19 08:38:16 pbs proxmox-backup-proxy[745]: rrd journal successfully committed (36 files in 0.016 seconds)
Aug 19 08:54:15 pbs proxmox-backup-proxy[745]: TASK ERROR: verification failed - please check the log for details
Aug 19 09:08:16 pbs proxmox-backup-proxy[745]: rrd journal successfully committed (36 files in 0.016 seconds)
 
2025-08-19T08:44:45+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z/drive-scsi0.img.fidx failed: error reading a body from connection
But this seems to be network related...? After all the error states that the client was not able to read the response. Does this always occur for the same snapshot and how reproducible is this?
 
But this seems to be network related...? After all the error states that the client was not able to read the response. Does this always occur for the same snapshot and how reproducible is this?
I've tried AWS s3 and Cloudflare R2 with the closest regions so ~ 12msec latency. Backups are perfectly fine and fast with no problems.
I've tried local s3 via Minio - on the network so < 1msec latency.

Failures are random. Retries will work and at other times not. Have to retry again.
I have 2 backups in Cloudflare R2 at the moment which just wont verify. Tried 4 times. Others working fine.

Its certaintly reproducable and I think it seems to be related to larger images. I say this - verification on images with a local s3 via Minio - smaller seems to work. Larger seems to stop/timeout and eventually fail. Re-run sometimes works. The other day - I had to verify a backup 3 times to get it verified against my local s3 instance.

Locally theres no network issues. Those servers are under constant monitoring so would know if there were network drops.
 
Have 1 node which is failing every verification against Cloudflare R2 backups and retries don’t work either.

Backups sync quickly and it’s on a high speed fibre connection. All other operations work fine - prune, gc etc.

Only different is the cache disk is mounted via an external HDD. Wonder if the performance of this could be the cause. Not much info to go on apart from what I shared above since there’s not much in the logs.

Will try a new cache disk - ssd and see if that makes any difference.

Thanks
 
Last edited:
Please share the exact error messages you get when verification fails, as that will help to identify issues! Just stating that the verification failed does not give enough context on why it failed in the particular setup and configuration you tested.

Locally theres no network issues. Those servers are under constant monitoring so would know if there were network drops.
I rather suspect the data being transferred to max out your bandwidth, leading to partial transfers of chunk contents and/or DNS lookup failures (at least that is what I observed when investigating an unrelated issue, which however also is most likely caused by network congestion https://forum.proxmox.com/threads/pbs-4-0-wasabi-post-fixed_chunk-400-bad-request.169620/).

I'm therefore currently evaluating how to best handle this, probably by implementing a shared rate limit for the s3 client implementation in PBS.

Failures are random. Retries will work and at other times not. Have to retry again.
This also point towards the actual chunk data being valid, but the network congestion and currently not implemented retry logic in case of failed chunk fetching.

Will try a new cache disk - ssd and see if that makes any difference.
The cache is bypassed for verification jobs, only on failed verification the local cached chunk is invalidated also. So this will not change verification issues as reported.

What you could try instead of verification jobs, is to sync your snapshot from the s3 backed datastore to a local datastore. This will also check the chunk presence and validity. Maybe that can be used as workaround, but you might run into the same issues again.
 
I'm seeing:

Code:
2025-08-21T00:01:15-07:00: SKIPPED: verify Backblaze:ct/111/2025-08-18T14:29:19Z (recently verified)
2025-08-21T00:01:15-07:00: percentage done: 56.25% (9/16 groups)
2025-08-21T00:01:15-07:00: verify group Backblaze:ct/113 (5 snapshots)
2025-08-21T00:01:15-07:00: verify Backblaze:ct/113/2025-08-21T05:07:14Z
2025-08-21T00:01:15-07:00:   check pct.conf.blob
2025-08-21T00:01:15-07:00:   check root.pxar.didx
2025-08-21T00:04:14-07:00: "can't verify chunk, load failed - client error (SendRequest)"
2025-08-21T00:04:14-07:00: corrupted chunk renamed to "/mnt/backblaze/.chunks/f3ae/f3ae24280a491318b53cd0b43df10a57082a4bccf50d97e5bdfd9adc30d02046.0.bad"
2025-08-21T00:04:40-07:00:   verified 1768.08/3116.49 MiB in 204.99 seconds, speed 8.63/15.20 MiB/s (1 errors)
2025-08-21T00:04:40-07:00: verify Backblaze:ct/113/2025-08-21T05:07:14Z/root.pxar.didx failed: chunks could not be verified
2025-08-21T00:04:40-07:00:   check catalog.pcat1.didx
2025-08-21T00:04:41-07:00:   verified 0.24/0.68 MiB in 0.85 seconds, speed 0.29/0.80 MiB/s (0 errors)
2025-08-21T00:04:41-07:00: percentage done: 57.50% (9/16 groups, 1/5 snapshots in group #10)
2025-08-21T00:04:41-07:00: SKIPPED: verify Backblaze:ct/113/2025-08-20T05:06:56Z (recently verified)
2025-08-21T00:04:41-07:00: percentage done: 58.75% (9/16 groups, 2/5 snapshots in group #10)
2025-08-21T00:04:41-07:00: SKIPPED: verify Backblaze:ct/113/2025-08-19T13:56:33Z (recently verified)
2025-08-21T00:04:41-07:00: percentage done: 60.00% (9/16 groups, 3/5 snapshots in group #10)
2025-08-21T00:04:41-07:00: SKIPPED: verify Backblaze:ct/113/2025-08-19T11:06:19Z (recently verified)
2025-08-21T00:04:41-07:00: percentage done: 61.25% (9/16 groups, 4/5 snapshots in group #10)
2025-08-21T00:04:41-07:00: SKIPPED: verify Backblaze:ct/113/2025-08-18T14:45:01Z (recently verified)
2025-08-21T00:04:41-07:00: percentage done: 62.50% (10/16 groups)
2025-08-21T00:04:41-07:00: verify group Backblaze:ct/115 (5 snapshots)
2025-08-21T00:04:41-07:00: verify Backblaze:ct/115/2025-08-21T05:09:04Z
2025-08-21T00:04:41-07:00:   check pct.conf.blob
2025-08-21T00:04:42-07:00:   check root.pxar.didx
2025-08-21T00:04:57-07:00: "can't verify chunk, load failed - client error (SendRequest)"
2025-08-21T00:04:57-07:00: corrupted chunk renamed to "/mnt/backblaze/.chunks/1581/158149be41b8488bee40e202dc4e599b80333d76016b3c4484e47fdc5bd63e2b.0.bad"
2025-08-21T00:12:58-07:00:   verified 3859.46/8820.43 MiB in 496.54 seconds, speed 7.77/17.76 MiB/s (1 errors)
2025-08-21T00:12:58-07:00: verify Backblaze:ct/115/2025-08-21T05:09:04Z/root.pxar.didx failed: chunks could not be verified
 
Code:
Aug 21 00:00:39 pbs systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Aug 21 00:00:39 pbs systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Aug 21 00:00:39 pbs systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Aug 21 00:02:48 pbs systemd[1]: Starting logrotate.service - Rotate log files...
Aug 21 00:02:48 pbs systemd[1]: logrotate.service: Deactivated successfully.
Aug 21 00:02:48 pbs systemd[1]: Finished logrotate.service - Rotate log files.
Aug 21 00:03:30 pbs chronyd[747]: Selected source 216.240.36.24 (2.debian.pool.ntp.org)
Aug 21 00:04:14 pbs proxmox-backup-proxy[780]: can't verify chunk, load failed - client error (SendRequest)
Aug 21 00:04:57 pbs proxmox-backup-proxy[780]: can't verify chunk, load failed - client error (SendRequest)
 
Code:
Aug 21 00:00:39 pbs systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Aug 21 00:00:39 pbs systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Aug 21 00:00:39 pbs systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Aug 21 00:02:48 pbs systemd[1]: Starting logrotate.service - Rotate log files...
Aug 21 00:02:48 pbs systemd[1]: logrotate.service: Deactivated successfully.
Aug 21 00:02:48 pbs systemd[1]: Finished logrotate.service - Rotate log files.
Aug 21 00:03:30 pbs chronyd[747]: Selected source 216.240.36.24 (2.debian.pool.ntp.org)
Aug 21 00:04:14 pbs proxmox-backup-proxy[780]: can't verify chunk, load failed - client error (SendRequest)
Aug 21 00:04:57 pbs proxmox-backup-proxy[780]: can't verify chunk, load failed - client error (SendRequest)
Thanks, could you also provide the (estimated) bandwidth you have available for up and download via the s3 endpoint? Would help in order to better reproduce the issue.
 
Thanks, could you also provide the (estimated) bandwidth you have available for up and download via the s3 endpoint? Would help in order to better reproduce the issue.
1 Gbps symmetrical. During the backup job PBS can literally push 940 Mbps sustained for the entire 110GB backup size.
 
More Backblaze failures (basically daily):
2025-08-28T06:59:41-07:00: percentage done: 68.75% (11/16 groups)
2025-08-28T06:59:41-07:00: verify group Backblaze:vm/101 (12 snapshots)
2025-08-28T06:59:41-07:00: verify Backblaze:vm/101/2025-08-28T12:00:07Z
2025-08-28T06:59:41-07:00: check qemu-server.conf.blob
2025-08-28T06:59:41-07:00: check drive-tpmstate0-backup.img.fidx
2025-08-28T06:59:41-07:00: verified 0.01/4.00 MiB in 0.16 seconds, speed 0.04/24.57 MiB/s (0 errors)
2025-08-28T06:59:41-07:00: check drive-scsi0.img.fidx
2025-08-28T07:08:05-07:00: "can't verify missing chunk with digest 3cefbf7ffcce046be3506b67593d87d2c95f84c4f309d18e63cc798db0d6d5bb"
2025-08-28T07:08:05-07:00: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
<Code>InvalidRequest</Code>
<Message>no such file: 4_z4b1ba7b2820bbbd09f890a1a_f404bd9644e5df75e_d20250827_m140349_c004_v7007000_t0000_u01756303429853</Message>
</Error>

2025-08-28T07:08:05-07:00: failed to copy corrupt chunk on s3 backend: 3cefbf7ffcce046be3506b67593d87d2c95f84c4f309d18e63cc798db0d6d5bb
2025-08-28T07:30:23-07:00: verified 30545.96/31988.00 MiB in 1842.16 seconds, speed 16.58/17.36 MiB/s (1 errors)
2025-08-28T07:30:23-07:00: verify Backblaze:vm/101/2025-08-28T12:00:07Z/drive-scsi0.img.fidx failed: chunks could not be verified
2025-08-28T07:30:23-07:00: check drive-efidisk0.img.fidx
 
2025-08-28T07:08:05-07:00: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
<Code>InvalidRequest</Code>
<Message>no such file: 4_z4b1ba7b2820bbbd09f890a1a_f404bd9644e5df75e_d20250827_m140349_c004_v7007000_t0000_u01756303429853</Message>
</Error>
Thanks for the report, this does however look more like an issue on backblaze side or like the request got mangled during transit.

Nevertheless, there are now patches available on the developer mailing list to fix most of the verification related issues, see https://lore.proxmox.com/pbs-devel/20250825131007.626777-1-c.ebner@proxmox.com/T/
 
  • Like
Reactions: jiriteach
  • Like
Reactions: jiriteach
  • Like
Reactions: jiriteach
proxmox-backup-server version 4.0.15-1 is available in the pbs-test repository. You might want to test if the included bugfixes solve your issue, thanks. To activate the test repo, please see https://pbs.proxmox.com/docs/installation.html#proxmox-backup-test-repository

A traffic rate limiter implementation for the s3 client is also in the working, but not yet included in this bugfix release, https://lore.proxmox.com/pbs-devel/20250916124147.513342-1-c.ebner@proxmox.com/T/#t
Thanks for the work on this. Running this version and seems to be working well so far - against AWS S3 buckets.