S3 Buckets constantly failing verification jobs

TrustyHippo

Member
Apr 9, 2023
30
11
13
I have configured BackBlaze S3, and backups seem to run just fine. However, every day I get failed verifications. I have two local NFS/NAS datastores, and I've never seen a failed verification. I know S3 support is not GA, but this is a bit worrisome.


2025-08-16_11-13-13.jpg
 
Can you please provide the verfication task log from the failed jobs as well as the sytemd journal from the timespan of the verification job? Also, check if this is related to the issue reported here https://bugzilla.proxmox.com/show_bug.cgi?id=6665
Thanks. Here is a verification job which failed earlier. Manually re-run after failure and it worked again.

Using Minio s3 object store which is local on my network. Backups working fine - no issue from multiple nodes. Verfication jobs - sometimes fail and after re-run's pass. I was experiencing the same error when using an AWS s3 bucket so tried to use something internally to test if it could be network related but same issue.

From the task log -

Code:
()
2025-08-19T08:21:27+12:00: verify datastore pbs-s3
2025-08-19T08:21:27+12:00: found 2 groups
2025-08-19T08:21:27+12:00: verify group pbs-s3:vm/101 (1 snapshots)
2025-08-19T08:21:27+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z
2025-08-19T08:21:27+12:00:   check qemu-server.conf.blob
2025-08-19T08:21:27+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:44:45+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z/drive-scsi0.img.fidx failed: error reading a body from connection
2025-08-19T08:44:45+12:00:   check drive-efidisk0.img.fidx
2025-08-19T08:44:46+12:00:   verified 0.02/0.52 MiB in 0.17 seconds, speed 0.12/3.00 MiB/s (0 errors)
2025-08-19T08:44:46+12:00: percentage done: 50.00% (1/2 groups)
2025-08-19T08:44:46+12:00: verify group pbs-s3:vm/105 (2 snapshots)
2025-08-19T08:44:46+12:00: verify pbs-s3:vm/105/2025-08-18T13:56:17Z
2025-08-19T08:44:46+12:00:   check qemu-server.conf.blob
2025-08-19T08:44:46+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:53:53+12:00:   verified 3999.12/11292.00 MiB in 546.80 seconds, speed 7.31/20.65 MiB/s (0 errors)
2025-08-19T08:53:53+12:00: percentage done: 75.00% (1/2 groups, 1/2 snapshots in group #2)
2025-08-19T08:53:53+12:00: verify pbs-s3:vm/105/2025-08-18T10:36:24Z
2025-08-19T08:53:53+12:00:   check qemu-server.conf.blob
2025-08-19T08:53:53+12:00:   check drive-scsi0.img.fidx
2025-08-19T08:54:14+12:00:   verified 137.71/832.00 MiB in 21.11 seconds, speed 6.52/39.41 MiB/s (0 errors)
2025-08-19T08:54:15+12:00: percentage done: 100.00% (2/2 groups)
2025-08-19T08:54:15+12:00: Failed to verify the following snapshots/groups:
2025-08-19T08:54:15+12:00:     vm/101/2025-08-18T13:30:02Z
2025-08-19T08:54:15+12:00: TASK ERROR: verification failed - please check the log for details

journalctl doesn't show much else apart from -

Code:
Aug 19 08:38:16 pbs proxmox-backup-proxy[745]: rrd journal successfully committed (36 files in 0.016 seconds)
Aug 19 08:54:15 pbs proxmox-backup-proxy[745]: TASK ERROR: verification failed - please check the log for details
Aug 19 09:08:16 pbs proxmox-backup-proxy[745]: rrd journal successfully committed (36 files in 0.016 seconds)
 
2025-08-19T08:44:45+12:00: verify pbs-s3:vm/101/2025-08-18T13:30:02Z/drive-scsi0.img.fidx failed: error reading a body from connection
But this seems to be network related...? After all the error states that the client was not able to read the response. Does this always occur for the same snapshot and how reproducible is this?
 
But this seems to be network related...? After all the error states that the client was not able to read the response. Does this always occur for the same snapshot and how reproducible is this?
I've tried AWS s3 and Cloudflare R2 with the closest regions so ~ 12msec latency. Backups are perfectly fine and fast with no problems.
I've tried local s3 via Minio - on the network so < 1msec latency.

Failures are random. Retries will work and at other times not. Have to retry again.
I have 2 backups in Cloudflare R2 at the moment which just wont verify. Tried 4 times. Others working fine.

Its certaintly reproducable and I think it seems to be related to larger images. I say this - verification on images with a local s3 via Minio - smaller seems to work. Larger seems to stop/timeout and eventually fail. Re-run sometimes works. The other day - I had to verify a backup 3 times to get it verified against my local s3 instance.

Locally theres no network issues. Those servers are under constant monitoring so would know if there were network drops.
 
Have 1 node which is failing every verification against Cloudflare R2 backups and retries don’t work either.

Backups sync quickly and it’s on a high speed fibre connection. All other operations work fine - prune, gc etc.

Only different is the cache disk is mounted via an external HDD. Wonder if the performance of this could be the cause. Not much info to go on apart from what I shared above since there’s not much in the logs.

Will try a new cache disk - ssd and see if that makes any difference.

Thanks
 
Last edited: