Inconsistent info about validation status

rahman

Renowned Member
Nov 1, 2010
86
4
73
Hi,

I have a Ceph Backed S3 datastore that is synced from local datastore. I am running weekly verify jobs. The problem is last job complains about validation errors but when I look content of datastore on PBS GUI it does not match:

Code:
Job ID:    v-35b45b04-d2ac
Datastore: ulakfkm

Verification failed on these snapshots/groups:

    ns/default/vm/163/2025-07-05T05:25:58Z
    ns/default/vm/164/2026-03-29T22:00:44Z
    ns/default/vm/183/2025-01-07T06:00:36Z
    ns/default/vm/206/2026-03-13T18:01:18Z
    ns/default/vm/206/2026-03-03T18:01:12Z


Please visit the web interface for further details:

1775044235162.png

As you can see in GUI vm/163/2025-07-05T05:25:58Z snapshot shows its validated.


1775044393209.png

vm/164/2026-03-29T22:00:44Z snapshot is not shown as failed.


1775044649547.png

For vm/206/ only one of the snapshots shown as failed, the other is shown as "ok".

So what should be the cause of this?

I also run daily GC jobs, daily sync jobs (pull from local), prune job 4 hour interval. Weekly verify job took about 2-3 days. Is it safe if GC, sync and prune jobs to run when verify job still running?

Regards,
Rahman
 

Attachments

Hi,
please provide the output of proxmox-backup-manager version --verbose.

First guess is that likely because the verification failed to download the chunk due to a transient network issue. But the verify state should be updated for that case as well. Could you also check the systemd journal for any s3 client related error messages around the time the verification is running and check if the snapshots are okay if verified manually?
 
Code:
root@pbs1:~# proxmox-backup-manager version --verbose
proxmox-backup                      4.0.0         running kernel: 6.17.13-2-pve
proxmox-backup-server               4.1.5-2       running version: 4.1.4
proxmox-kernel-helper               9.0.4
proxmox-kernel-6.17                 6.17.13-2
proxmox-kernel-6.17.13-2-pve-signed 6.17.13-2
proxmox-kernel-6.17.4-2-pve-signed  6.17.4-2
proxmox-kernel-6.17.2-1-pve-signed  6.17.2-1
ifupdown2                           3.3.0-1+pmx12
libjs-extjs                         7.0.0-5
proxmox-backup-docs                 4.1.5-1
proxmox-backup-client               4.1.5-1
proxmox-mail-forward                1.0.2
proxmox-mini-journalreader          1.6
proxmox-offline-mirror-helper       0.7.3
proxmox-widget-toolkit              5.1.8
pve-xtermjs                         5.5.0-3
smartmontools                       7.4-pve1
zfsutils-linux                      2.4.1-pve1

Here is the journal logs for this snapshot:

Code:
2026-03-30T12:23:48+03:00: verify ulakfkm:vm/183/2025-01-07T06:00:36Z
2026-03-30T12:23:48+03:00:   check qemu-server.conf.blob
2026-03-30T12:23:48+03:00:   check drive-scsi1.img.fidx
2026-03-30T13:39:36+03:00:   verified 271680.08/413196.00 MiB in 4548.00 seconds, speed 59.74/90.85 MiB/s (6 errors)
2026-03-30T13:39:36+03:00: verify ulakfkm:vm/183/2025-01-07T06:00:36Z/drive-scsi1.img.fidx failed: chunks could not be verified

Code:
root@pbs1:~# journalctl --since "2026-03-30T12:23:48+03:00" --until "2026-03-30T13:39:36+03:00"
Mar 30 12:41:06 pbs1 smartd[1891]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Mar 30 12:41:06 pbs1 smartd[1891]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Mar 30 12:43:38 pbs1 proxmox-backup-proxy[2299]: rrd journal successfully committed (33 files in 0.034 seconds)
Mar 30 13:11:06 pbs1 smartd[1891]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Mar 30 13:11:06 pbs1 smartd[1891]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Mar 30 13:13:38 pbs1 proxmox-backup-proxy[2299]: rrd journal successfully committed (33 files in 0.032 seconds)
Mar 30 13:17:01 pbs1 CRON[3147724]: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
Mar 30 13:17:01 pbs1 CRON[3147726]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 30 13:17:01 pbs1 CRON[3147724]: pam_unix(cron:session): session closed for user root
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection

I will run a verify job for vm/183 snapshot for test.

Also any feedback about this subject:
I also run daily GC jobs, daily sync jobs (pull from local), prune job 4 hour interval. Weekly verify job took about 2-3 days. Is it safe if GC, sync and prune jobs to run when verify job still running?
 
Mar 30 13:17:18 pbs1 proxmox-backup-proxy[2299]: can't verify chunk, load failed - error reading a body from connection
Okay, so as expected there is a transient network error, the verification task fails due to this, but it is not possible to determine if the snapshot is okay or corrupt from this. Will check if we can improve the verify state handling here.

I also run daily GC jobs, daily sync jobs (pull from local), prune job 4 hour interval. Weekly verify job took about 2-3 days. Is it safe if GC, sync and prune jobs to run when verify job still running?
Yes, although I would not recommend to many operations at the same time. The more tasks are run in parallel, the more requests are being send at the same time to the S3 endpoint API. Could be that you are seeing above connection errors because of this.
 
Yes, although I would not recommend to many operations at the same time. The more tasks are run in parallel, the more requests are being send at the same time to the S3 endpoint API. Could be that you are seeing above connection errors because of this.

As verify takes so long, GC and sync jobs runs once a day while verify also running. Can't know if it causes this. But I checked sync and GC jobs for "2026-03-30T12:23:48+03:00" to "2026-03-30T13:39:36+03:00" which is vm/183 verify errored, and there is was no gc or sync jobs running between these times. GC job scheduled to run every midnight , and sync job every night at 03:00, both Turkey timezone (+03:00).
 
Code:
2026-04-01T16:11:38+03:00: verify ulakfkm:vm/206/2026-03-03T18:01:12Z
2026-04-01T16:11:38+03:00:   check qemu-server.conf.blob
2026-04-01T16:11:38+03:00:   check drive-scsi0.img.fidx
2026-04-01T16:32:48+03:00:   verified 23359.43/50996.00 MiB in 1270.29 seconds, speed 18.39/40.15 MiB/s (0 errors)
2026-04-01T16:32:49+03:00: TASK OK

Code:
2026-04-01T16:43:07+03:00: verify ulakfkm:vm/183/2025-01-07T06:00:36Z
2026-04-01T16:43:07+03:00:   check qemu-server.conf.blob
2026-04-01T16:43:07+03:00:   check drive-scsi1.img.fidx
2026-04-01T21:18:19+03:00:   verified 271704.08/413232.00 MiB in 16511.23 seconds, speed 16.46/25.03 MiB/s (0 errors)
2026-04-01T21:18:19+03:00:   check drive-scsi0.img.fidx
2026-04-02T02:28:29+03:00:   verified 166003.01/1013796.00 MiB in 18610.11 seconds, speed 8.92/54.48 MiB/s (0 errors)
2026-04-02T02:28:29+03:00: TASK OK

I tested to manually verify two of the failed snapshots and they validated without any problem.
 
Last edited: