Push sync partially failed, but sender remains clueless and happy

wrobelda

Member
Apr 13, 2022
50
2
13
So I configured my PBS to push sync to a remote location. This worked and 2/3 VM backups pushed fine. However, one of them failed at pct.conf.blob, as seen on the receiver end:Screenshot 2025-04-02 at 9.36.51 PM.png

The verification job I run also failed, expectedly so:

Code:
2025-04-02T00:00:00+02:00: Failed to verify the following snapshots/groups:
2025-04-02T00:00:00+02:00:     ct/110/2025-03-29T01:50:12Z


However, the sender seems to be happy as it appears to be skipping the snapshot that failed to upload:

Code:
2025-04-02T21:00:00+02:00: Found 4 groups to sync (out of 24 total)
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T00:30:03Z .. 2025-04-01T23:30:02Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T01:50:12Z .. 2025-04-01T23:50:03Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 2 snapshot(s) (2025-03-31T20:24:22Z .. 2025-04-01T23:58:59Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T01:48:36Z .. 2025-04-01T23:48:47Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: Finished syncing root namespace, current progress: 3 groups, 0 snapshots
2025-04-02T21:00:00+02:00: Summary: sync job found no new data to push

This is quite worrying, to be honest, since an upload can fail at any given point rendering a snapshot useless to no knowledge of the pushing side:

Screenshot 2025-04-02 at 9.56.50 PM.png

Is this a bug? I would imagine you do check whether snapshot data was uploaded, but maybe don't check for the snapshot's own metadata (pct.conf.blob, index.json.blob)?
 
Last edited:
Hi,
So I configured my PBS to push sync to a remote location. This worked and 2/3 VM backups pushed fine. However, one of them failed at pct.conf.blob, as seen on the receiver end:View attachment 84448

The verification job I run also failed, expectedly so:

Code:
2025-04-02T00:00:00+02:00: Failed to verify the following snapshots/groups:
2025-04-02T00:00:00+02:00:     ct/110/2025-03-29T01:50:12Z
Can you check what the contents are on the source side? Does the source side verify? It seems that on the target there is no manifest (written as index.json.blob). That is however needed for a snapshot to be valid. So either the snapshot was already invalid on the source and the job pushed only the config blob, or the manifest was not synced correctly for some reason. Let's find out by checking the task log for the corresponding snapshot on the sync target. Please identify the corresponding task (push creates a backup task for each snapshot it sync on the target) and share the logs. That should tell us more.
However, the sender seems to be happy as it appears to be skipping the snapshot that failed to upload:

Code:
2025-04-02T21:00:00+02:00: Found 4 groups to sync (out of 24 total)
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T00:30:03Z .. 2025-04-01T23:30:02Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T01:50:12Z .. 2025-04-01T23:50:03Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 2 snapshot(s) (2025-03-31T20:24:22Z .. 2025-04-01T23:58:59Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: skipped: 4 snapshot(s) (2025-03-29T01:48:36Z .. 2025-04-01T23:48:47Z) - older than the newest snapshot present on sync target
2025-04-02T21:00:00+02:00: Finished syncing root namespace, current progress: 3 groups, 0 snapshots
2025-04-02T21:00:00+02:00: Summary: sync job found no new data to push
The sync job does not consider any older snapshots, resync is currently possible only with pull jobs.

This is quite worrying, to be honest, since an upload can fail at any given point rendering a snapshot useless to no knowledge of the pushing side:

View attachment 84450

Is this a bug? I would imagine you do check whether snapshot data was uploaded, but maybe don't check for the snapshot's own metadata (pct.conf.blob, index.json.blob)?
Well, let's find out what happened and yes there are consistency checks and in place. However, I would not jump to conclusions just yet.
 
Hi Chris and that apologies for my late response.

Hi,

Can you check what the contents are on the source side? Does the source side verify? It seems that on the target there is no manifest (written as index.json.blob). That is however needed for a snapshot to be valid. So either the snapshot was already invalid on the source and the job pushed only the config blob, or the manifest was not synced correctly for some reason.
I checked the source and it verifies fine. I believe the problem was related to some instabilities with the source storage, which would cause the pool to go offline intermittently. I did restart the system(s) while the push sync job was hung up, which would effectively kill it. So with that in mind I am guessting that stopping a sync job half way through could result in such broken snapshot on the target?

Let's find out by checking the task log for the corresponding snapshot on the sync target. Please identify the corresponding task (push creates a backup task for each snapshot it sync on the target) and share the logs. That should tell us more.

Code:
===== ./2B/UPID:nova:000002C0:0000092B:00000003:67EBC7BD:backup:nova-pbs-hdd:ct-110:pbs_home@pbs: =====
2025-04-01T13:02:21+02:00: add blob "/mnt/pbs_storage/ct/110/2025-03-29T01:50:12Z/pct.conf.blob" (261 bytes, comp: 261)
2025-04-01T13:02:21+02:00: created new dynamic index 1 ("ct/110/2025-03-29T01:50:12Z/root.pxar.didx")

===== ./4D/UPID:nova:000002BB:0000094D:00000006:67EC0FE7:verificationjob:nova-pbs-hdd:v-a66e6731-ed66:root@pam: =====
2025-04-01T18:10:15+02:00: Starting datastore verify job 'nova-pbs-hdd:v-a66e6731-ed66'
2025-04-01T18:10:15+02:00: verify datastore nova-pbs-hdd
2025-04-01T18:10:15+02:00: found 4 groups
2025-04-01T18:10:15+02:00: verify group nova-pbs-hdd:ct/100 (3 snapshots)
2025-04-01T18:10:15+02:00: SKIPPED: verify nova-pbs-hdd:ct/100/2025-03-31T19:45:21Z (recently verified)
2025-04-01T18:10:15+02:00: percentage done: 8.33% (0/4 groups, 1/3 snapshots in group #1)
2025-04-01T18:10:15+02:00: SKIPPED: verify nova-pbs-hdd:ct/100/2025-03-30T23:30:00Z (recently verified)
2025-04-01T18:10:15+02:00: percentage done: 16.67% (0/4 groups, 2/3 snapshots in group #1)
2025-04-01T18:10:15+02:00: SKIPPED: verify nova-pbs-hdd:ct/100/2025-03-29T00:30:03Z (recently verified)
2025-04-01T18:10:15+02:00: percentage done: 25.00% (1/4 groups)
2025-04-01T18:10:15+02:00: verify group nova-pbs-hdd:ct/110 (3 snapshots)
2025-04-01T18:10:15+02:00: verify nova-pbs-hdd:ct/110/2025-03-31T20:17:43Z
2025-04-01T18:10:15+02:00:   check pct.conf.blob
2025-04-01T18:10:15+02:00:   check root.pxar.didx
2025-04-01T18:20:33+02:00:   verified 120284.27/120405.87 MiB in 618.67 seconds, speed 194.42/194.62 MiB/s (0 errors)
2025-04-01T18:20:33+02:00:   check catalog.pcat1.didx
2025-04-01T18:20:33+02:00:   verified 1.01/2.52 MiB in 0.01 seconds, speed 199.00/495.43 MiB/s (0 errors)
2025-04-01T18:20:33+02:00: percentage done: 33.33% (1/4 groups, 1/3 snapshots in group #2)
2025-04-01T18:20:33+02:00: SKIPPED: verify nova-pbs-hdd:ct/110/2025-03-31T00:48:04Z (recently verified)
2025-04-01T18:20:33+02:00: percentage done: 41.67% (1/4 groups, 2/3 snapshots in group #2)
2025-04-01T18:20:33+02:00: verify nova-pbs-hdd:ct/110/2025-03-29T01:50:12Z - manifest load error: unable to load blob '"/mnt/pbs_storage/ct/110/2025-03-29T01:50:12Z/index.json.blob"' - No such file or directory (os error 2)
2025-04-01T18:20:33+02:00: percentage done: 50.00% (2/4 groups)
2025-04-01T18:20:33+02:00: verify group nova-pbs-hdd:ct/113 (1 snapshots)
2025-04-01T18:20:33+02:00: SKIPPED: verify nova-pbs-hdd:ct/113/2025-03-31T20:24:22Z (recently verified)
2025-04-01T18:20:33+02:00: percentage done: 75.00% (3/4 groups)
2025-04-01T18:20:33+02:00: verify group nova-pbs-hdd:vm/109 (3 snapshots)
2025-04-01T18:20:33+02:00: SKIPPED: verify nova-pbs-hdd:vm/109/2025-03-31T20:17:30Z (recently verified)
2025-04-01T18:20:33+02:00: percentage done: 83.33% (3/4 groups, 1/3 snapshots in group #4)
2025-04-01T18:20:33+02:00: SKIPPED: verify nova-pbs-hdd:vm/109/2025-03-31T00:46:52Z (recently verified)
2025-04-01T18:20:33+02:00: percentage done: 91.67% (3/4 groups, 2/3 snapshots in group #4)
2025-04-01T18:20:33+02:00: SKIPPED: verify nova-pbs-hdd:vm/109/2025-03-29T01:48:36Z (recently verified)
2025-04-01T18:20:33+02:00: percentage done: 100.00% (4/4 groups)
2025-04-01T18:20:33+02:00: Failed to verify the following snapshots/groups:
2025-04-01T18:20:33+02:00:     ct/110/2025-03-29T01:50:12Z
2025-04-01T18:20:33+02:00: queued notification (id=67e42b0e-c181-4e97-876c-3b5732e751c8)
2025-04-01T18:20:33+02:00: TASK ERROR: verification failed - please check the log for details
 
Last edited:
I did restart the system(s) while the push sync job was hung up, which would effectively kill it. So with that in mind I am guessting that stopping a sync job half way through could result in such broken snapshot on the target?
Under normal conditions, when a sync job is unable to sync a snapshot it cleans after itself. If the process got forcefully interuppted then yes, that might lead to remnants of the snapshot on the target. In such cases, probably the best way is to run a verification job immediately afterwards and cleanup the unverified ones. For pull sync jobs you can also run with the re-sync flag enabled. That will check all snapshots again and resync those which fail to verify. This is currently however not available for sync jobs in push direction because of implementation differences.