[SOLVED] Remote push sync job loops forever on "Found X groups to sync (out of XX total)"

Matrix9688 · 2025-01-16T13:18:48+0100

After completely pruning a group on my onsite PBS that I want to re-sync from my offsite PBS (see my previous post) the job suddenly loops continually with the line "Found X groups to sync (out of XX total)".

The logs show:

2025-01-16T10:51:52+01:00: Starting datastore sync job '$REMOTE_CONFIGURATION:$REMOTE_DATASTORE:$SOURCE_DATASTORE:$SOURCE_NAMESPACE:s-00ee9145-853c'
2025-01-16T10:51:52+01:00: sync datastore '$SOURCE_DATASTORE' to '$REMOTE_CONFIGURATION/$REMOTE_DATASTORE'
2025-01-16T10:51:52+01:00: ----
2025-01-16T10:51:52+01:00: Syncing datastore '$SOURCE_DATASTORE', namespace '$SOURCE_NAMESPACE' into datastore '$REMOTE_DATASTORE', namespace '$REMOTE_NAMESPACE'
2025-01-16T10:51:52+01:00: Found 1 groups to sync (out of 54 total)
2025-01-16T12:22:46+01:00: queued notification (id=f19f4445-327a-4e1f-81a3-93bdacca0119)
2025-01-16T12:22:46+01:00: TASK ERROR: sync aborted

2025-01-16T10:51:53+01:00: starting new backup on datastore '$REMOTE_DATASTORE' from :$IP_ADDRESS: "ns/$REMOTE_NAMESPACE/host/archiv/2024-08-22T12:14:01Z"
2025-01-16T10:51:53+01:00: created new dynamic index 1 ("ns/$REMOTE_NAMESPACE/host/archiv/2024-08-22T12:14:01Z/HF-SFL-FS1.pxar.didx")
2025-01-16T12:22:46+01:00: backup failed: connection error: connection reset
2025-01-16T12:22:46+01:00: removing failed backup
2025-01-16T12:22:46+01:00: TASK ERROR: connection error: connection reset
2025-01-16T12:22:46+01:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
2025-01-16T12:22:46+01:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
2025-01-16T12:22:46+01:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
[last line repeats multiple times]

I manually stopped the job after 1,5h of no change.

The weird thing is, that it got further along than this before I pruned the whole group on the target datastore:

2025-01-16T10:00:33+01:00: Starting datastore sync job '$REMOTE_CONFIGURATION:$TARGET_DATASTORE:$SOURCE_DATASTORE:$SOURCE_NAMESPACE:s-00ee9145-853c'
2025-01-16T10:00:33+01:00: sync datastore '$SOURCE_DATASTORE' to '$REMOTE_CONFIGURATION/$TARGET_DATASTORE'
2025-01-16T10:00:33+01:00: ----
2025-01-16T10:00:33+01:00: Syncing datastore '$SOURCE_DATASTORE', namespace '$SOURCE_NAMESPACE' into datastore '$TARGET_DATASTORE', namespace '$TARGET_NAMESPACE'
2025-01-16T10:00:33+01:00: Found 1 groups to sync (out of 54 total)
2025-01-16T10:00:33+01:00: skipped: 2 snapshot(s) (2024-08-22T12:14:01Z .. 2024-08-22T13:51:18Z) - older than the newest snapshot present on sync target
2025-01-16T10:00:33+01:00: Finished syncing namespace $SOURCE_NAMESPACE, current progress: 0 groups, 0 snapshots
2025-01-16T10:00:33+01:00: Summary: sync job found no new data to push
2025-01-16T10:00:33+01:00: sync job '$REMOTE_CONFIGURATION:$TARGET_DATASTORE:$SOURCE_DATASTORE:$SOURCE_NAMESPACE:s-00ee9145-853c' end
2025-01-16T10:00:33+01:00: queued notification (id=a131c2c1-53d1-465a-9459-40eeaead9c7e)
2025-01-16T10:00:33+01:00: TASK OK

Afterwards, I noticed in the Administration > Server Status dashboards that there was actually some kind of data transfer happening.

So, I guess my question ist: is this intended behaviour? Did everything work as designed and I simply wasn't patient enough?

Chris · 2025-01-16T13:43:04+0100

Hi again,
so just to understand your task logs correctly: you obfuscated the logs here, right? Or are these the plain log files as you get them?
I suppose the connection reset error you get after aborting the push job? That is expected, as then the connection to the remote target was closed.

How is your datastore organized? Do you have multiple namespaces?

Matrix9688 said:
So, I guess my question ist: is this intended behaviour? Did everything work as designed and I simply wasn't patient enough?

I guess you just would need to wait a bit longer. Maybe monitor the storage space usage and the task logs on the target. For each of the pushed snapshots you should see individual backup task in the task log on the target server.

Edit: And regarding logging for sync jobs (especially push), this needs further improvements, yes. Feel free to add furhter comments to the issue here https://bugzilla.proxmox.com/show_bug.cgi?id=5956, if needed.

Matrix9688 · 2025-01-16T14:00:24+0100

so just to understand your task logs correctly: you obfuscated the logs here, right? Or are these the plain log files as you get them?

Yes, I obfuscated them - but It seems I didn't do properly, as I forgot to update the first two logs when I switched out "$REMOTE_" for "$TARGET_". Sorry about that.

How is your datastore organized? Do you have multiple namespaces?

I have some namespace hierarchy, yes:

on the source PBS (the offsite PBS instance):
- source-datastore/target-PBS/pve-host
on the target PBS (the onsite PBS instance):
- target-datastore/pve-host

All this obfuscation makes it seem very complicated, but it really isn't. I don't think it's a namespace issue since the previous run was correctly able to see the snapshots of the group before I deleted them.

For each of the pushed snapshots you should see individual backup task in the task log on the target server.

I guess I see that, yes - but it's just that it "created new dynamic index", not something along the lines of "receiving data for file $DATASET.pxar.didx". Without that, it just seems as if it's stuck creating that dynamic index.

I guess you just would need to wait a bit longer.

If that's the case, I would suggest some kind of additional log message, especially on the source side, so that it's a bit more clear it's working correctly. Pull sync jobs are a bit more verbose:

2025-01-15T00:13:13+01:00: sync snapshot vm/146/2025-01-14T17:25:09Z
2025-01-15T00:13:13+01:00: sync archive qemu-server.conf.blob
2025-01-15T00:13:13+01:00: sync archive drive-scsi0.img.fidx
2025-01-15T00:27:37+01:00: downloaded 9.189 GiB (10.892 MiB/s)

Chris · 2025-01-16T14:07:53+0100

Matrix9688 said:
I guess I see that, yes - but it's just that it "created new dynamic index", not something along the lines of "receiving data for file $DATASET.pxar.didx". Without that, it just seems as if it's stuck creating that dynamic index.

Yes, for the first snapshot of the group all chunks will have to be send over the wire. So that one will take some time... For subsequent ones only the chunks not present in the corresponding previously synced snapshot will have to be send, those will be faster (if not much data changed in-between these snapshots).

Matrix9688 said:
If that's the case, I would suggest some kind of additional log message, especially on the source side, so that it's a bit more clear it's working correctly. Pull sync jobs are a bit more verbose:

Yes, see the issue linked above.

Matrix9688 · 2025-01-17T07:22:04+0100

Can confirm: the first snapshot has finally finished transferring this morning, and the next snapshot got a new job log on the target PBS.

Search

Search

[SOLVED] Remote push sync job loops forever on "Found X groups to sync (out of XX total)"

Matrix9688

New Member

Chris

Proxmox Staff Member

Matrix9688

New Member

Chris

Proxmox Staff Member

Matrix9688

New Member