[SOLVED] Large backups over unreliable network

isaacntk · Sep 26, 2023

PVE Version: 7.4-15
PBS Version: 2.4-3

I've had a weekly automatic backup job for a container that is currently 2.7TB, the connection speed is about 100mbps or slower

Not sure if relevant but the backup config:
- Compression: ZSTD
- Mode: Snapshot

It hasn't succeeded since June because it will always fail with the following logs after about two days:

PVE Logs:

Code:

INFO: starting new backup job: vzdump 206 --mode snapshot --mailnotification failure --node beta --storage mars-nextcloud --all 0 --notes-template '{{guestname}} routine remote backup' --mailto xxx
INFO: Starting Backup of VM 206 (lxc)
INFO: Backup started at 2023-09-24 19:13:47
INFO: status = running
INFO: CT Name: nextcloud
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating Proxmox Backup Server archive 'ct/206/2023-09-24T11:13:47Z'
INFO: run: /usr/bin/proxmox-backup-client backup --crypt-mode=none pct.conf:/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.fw root.pxar:/mnt/vzsnap0 --include-dev /mnt/vzsnap0/./ --skip-lost-and-found --exclude=/tmp/?* --exclude=/var/tmp/?* --exclude=/var/run/?*.pid --backup-type ct --backup-id 206 --backup-time 1695554027 --repository proxmox@pbs@172.22.2.1:beta-backup --ns nextcloud
INFO: Starting backup: [nextcloud]:ct/206/2023-09-24T11:13:47Z
INFO: Client name: beta
INFO: Starting backup protocol: Sun Sep 24 19:13:48 2023
INFO: Downloading previous manifest (Tue Jun 20 13:00:02 2023)
INFO: Upload config file '/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.conf' to 'proxmox@pbs@172.22.2.1:8007:beta-backup' as pct.conf.blob
INFO: Upload config file '/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.fw' to 'proxmox@pbs@172.22.2.1:8007:beta-backup' as fw.conf.blob
INFO: Upload directory '/mnt/vzsnap0' to 'proxmox@pbs@172.22.2.1:8007:beta-backup' as root.pxar.didx
INFO: HTTP/2.0 connection failed
INFO: catalog upload error - channel closed
INFO: Error: stream closed because of a broken pipe
INFO: cleanup temporary 'vzdump' snapshot
ERROR: Backup of VM 206 failed - command '/usr/bin/proxmox-backup-client backup '--crypt-mode=none' pct.conf:/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp3954655_206/etc/vzdump/pct.fw root.pxar:/mnt/vzsnap0 --include-dev /mnt/vzsnap0/./ --skip-lost-and-found '--exclude=/tmp/?*' '--exclude=/var/tmp/?*' '--exclude=/var/run/?*.pid' --backup-type ct --backup-id 206 --backup-time 1695554027 --repository proxmox@pbs@172.22.2.1:beta-backup --ns nextcloud' failed: exit code 255
INFO: Failed at 2023-09-27 01:05:08
INFO: Backup job finished with errors
TASK ERROR: job errors

PBS Logs:

Code:

2023-09-24T19:13:49+08:00: starting new backup on datastore 'beta-backup': "ns/nextcloud/ct/206/2023-09-24T11:13:47Z"
2023-09-24T19:13:49+08:00: protocol upgrade done
2023-09-24T19:13:49+08:00: GET /previous_backup_time
2023-09-24T19:13:49+08:00: GET /previous
2023-09-24T19:13:49+08:00: download 'index.json.blob' from previous backup.
2023-09-24T19:13:49+08:00: POST /blob
2023-09-24T19:13:49+08:00: add blob "/mnt/datastore/beta-backup/ns/nextcloud/ct/206/2023-09-24T11:13:47Z/pct.conf.blob" (319 bytes, comp: 319)
2023-09-24T19:13:50+08:00: POST /blob
2023-09-24T19:13:50+08:00: add blob "/mnt/datastore/beta-backup/ns/nextcloud/ct/206/2023-09-24T11:13:47Z/fw.conf.blob" (64 bytes, comp: 64)
2023-09-24T19:13:50+08:00: POST /dynamic_index
2023-09-24T19:13:50+08:00: created new dynamic index 1 ("ns/nextcloud/ct/206/2023-09-24T11:13:47Z/catalog.pcat1.didx")
2023-09-24T19:13:50+08:00: GET /previous
2023-09-24T19:13:50+08:00: register chunks in 'root.pxar.didx' from previous backup.
2023-09-24T19:13:50+08:00: download 'root.pxar.didx' from previous backup.
2023-09-24T19:14:01+08:00: POST /dynamic_index
2023-09-24T19:14:01+08:00: created new dynamic index 2 ("ns/nextcloud/ct/206/2023-09-24T11:13:47Z/root.pxar.didx")
2023-09-24T19:14:02+08:00: POST /dynamic_chunk
... omitted 40mb of POST /dynamic_chunk
2023-09-27T00:48:49+08:00: upload_chunk done: 12341968 bytes, b8819371a5e3ec265b90535429c6a312e75fd33caeacaf35fb2bfe485f14571a
2023-09-27T00:48:50+08:00: POST /dynamic_chunk
2023-09-27T00:48:50+08:00: upload_chunk done: 5399224 bytes, 1d413b091f539b6d1eb94ee9049de9077d20a53fe5116b914ac61f9ee3d1a035
2023-09-27T01:05:00+08:00: backup failed: connection error: connection reset
2023-09-27T01:05:00+08:00: removing failed backup
2023-09-27T01:05:00+08:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
2023-09-27T01:05:00+08:00: TASK ERROR: connection error: connection reset

I'm certain this is just because the network might drop once in a while whether it is the VPN I use or the actual internet connection at either end. Does PBS have a way to handle unreliable network connections and resume after a broken connection?

Also is there any reason Proxmox is transferring the entirety of the container rather than only what has changed?

Do I have better options for achieving a more reliable backup over a potentially spotty internet connection?

Side note: I'm not sure if this is fixed in newer versions of PBS, but the log viewer will crash chrome trying to render the entirety of a 40MB logfile if you don't click download and close the window fast enough

Dunuin · Sep 27, 2023

isaacntk said:
Also is there any reason Proxmox is transferring the entirety of the container rather than only what has changed?

It's usually always reading everything but then only sending chunks that don't exist on the PBS yet. LXC can't make use of dirty bitmapping, so for backing up big amounts of data, that don't change that much, a VM would be a better choice.

isaacntk said:
Do I have better options for achieving a more reliable backup over a potentially spotty internet connection?

Not sure if that really helps, but usually you backup to a onsite PBS and then tell the offsite PBS via sync job to pull the newest backup snapshots from the onsite PBS. Also got the benefit of ransomware protection when setting up proper privileges, backups will be faster finished so shorter slowdown of your PVE host and in case you really need to do a restore you got the option to do a fast local restore first with the slow offsite PBS as a last option.

isaacntk · Sep 27, 2023

Dunuin said:
It's usually always reading everything but then only sending chunks that don't exist on the PBS yet. LXC can't make use of dirty bitmapping, so for backing up big amounts of data, that don't change that much, a VM would be a better choice.

Not sure if that really helps, but usually you backup to a onsite PBS and then tell the offsite PBS via sync job to pull the newest backup snapshots from the onsite PBS. Also got the benefit of ransomware protection when setting up proper privileges, backups will be faster finished so shorter slowdown of your PVE host and in case you really need to do a restore you got the option to do a fast local restore first with the slow offsite PBS as a last option.

Ah so it was my choice to use an LXC that forces a full send? I assume the underlying storage type being ZFS doesn't help then?

I think I've seen this suggestion before to do an onsite PBS and an offsite one, but that would require additional space onsite, don't really have enough at the moment, but I could consider it.

I also saw a reddit thread saying that for LXCs the full image has to be read by PBS, but it would only write the changed contents to the datastore, so I could have an offsite SMB mount or something, but that sounds super janky, and might still break anyway https://www.reddit.com/r/Proxmox/comments/11wcqoh/proxmox_backup_server_in_a_slow_network/

I'm still more surprised that PVE/PBS doesn't have some pause/resume/retry system and just drops a whole backup job if the network fails at any point. I don't mind the slow speed, but it would be nice if I didn't have to restart a 3 day job because the connection went off for a few seconds. Is there a good place to send in a feature request for this?

Dunuin · Sep 27, 2023

isaacntk said:
Ah so it was my choice to use an LXC that forces a full send?

It's not a full send, just a full read + partial send...at least on following backups. The first backup of cause is always a full send.

isaacntk said:
I assume the underlying storage type being ZFS doesn't help then?

No, storage doesn't matter.

isaacntk said:
I also saw a reddit thread saying that for LXCs the full image has to be read by PBS, but it would only write the changed contents to the datastore, so I could have an offsite SMB mount or something, but that sounds super janky, and might still break anyway

PBS needs IOPS performance. An offsite SMB storage would perform terrible with all the additional latency.

isaacntk said:
I'm still more surprised that PVE/PBS doesn't have some pause/resume/retry system and just drops a whole backup job if the network fails at any point. I don't mind the slow speed, but it would be nice if I didn't have to restart a 3 day job because the connection went off for a few seconds. Is there a good place to send in a feature request for this?

As far as I understand uploaded but unreferenced chunks will only be deleted by the next GC. So unless 24 hours have passed and a GC was running the chunks should still be there and shouldn't be needed to be uploaded again when you run the backup again.

fabian · Sep 27, 2023

isaacntk said:
I'm still more surprised that PVE/PBS doesn't have some pause/resume/retry system and just drops a whole backup job if the network fails at any point. I don't mind the slow speed, but it would be nice if I didn't have to restart a 3 day job because the connection went off for a few seconds. Is there a good place to send in a feature request for this?

we've discussed some sort of "checkpoint" feature in the past (although I couldn't find the discussion right now, so not sure whether that was in a bug report, here on the forum, on the list or somewhere internal..) that would allow resuming directly using the intermediate partial state and the full snapshot before that..

Dunuin said:
As far as I understand uploaded but unreferenced chunks will only be deleted by the next GC. So unless 24 hours have passed and a GC was running the chunks should still be there and shouldn't be needed to be uploaded again when you run the backup again.

because right now, this is only half true. if you have this sequence of events:

1. successful backup
2. partial backup that got interrupted
3. new backup attempt

than 3 will directly (client-side) deduplicate with the chunks from 1 and skip uploading those chunks, but any new chunks that were uploaded by 2 before the interruption which are not part of the chunk list of snapshot 1 will only be deduplicated server-side (the client does not know about them, so it will upload them again, but the server will then see that they are already there in the chunk store, and discard them to save the write OP).

isaacntk · Sep 27, 2023

Dunuin said:
PBS needs IOPS performance. An offsite SMB storage would perform terrible with all the additional latency.

Ah thats that idea out the window then.

fabian said:
than 3 will directly (client-side) deduplicate with the chunks from 1 and skip uploading those chunks, but any new chunks that were uploaded by 2 before the interruption which are not part of the chunk list of snapshot 1 will only be deduplicated server-side (the client does not know about them, so it will upload them again, but the server will then see that they are already there in the chunk store, and discard them to save the write OP).

I do have existing backups, though they are very old, but most of the data shouldn't have changed.

If I encounter a backup failure due to another network issue, and retry another backup within 24 hours, is there a specific line in the logs I should be looking for to verify the client side chunks are being deduplicated?

Dunuin · Sep 27, 2023

isaacntk said:
I do have existing backups, though they are very old, but most of the data shouldn't have changed.

If I encounter a backup failure due to another network issue, and retry another backup within 24 hours, is there a specific line in the logs I should be looking for to verify the client side chunks are being deduplicated?

As far as I understand fabian, when you do a backup and it fails and try it again, all uploaded data will be discarded/dropped so you need to upload everything (since the last successful backup) again. Chunks are still there on the PBS but it's unknown that the chunks are there, so they get reuploaded anyway.

So some checkpointing would be indeed a very useful feature.

fabian · Sep 28, 2023

you can check the deduplication stats in the server-side task log. it will contain the information for each index file, looking like this:

Code:

2023-03-15T09:00:16+01:00: POST /dynamic_close
2023-03-15T09:00:16+01:00: Upload statistics for 'root.pxar.didx'
2023-03-15T09:00:16+01:00: UUID: a514c0f6f63a45bfb68dd239531cef01
2023-03-15T09:00:16+01:00: Checksum: d71ad9d5bf65a1e2745095ecbcbaea3ec366dd6bab52ead6999fed37c0dd0aa3
2023-03-15T09:00:16+01:00: Size: 922924239
2023-03-15T09:00:16+01:00: Chunk count: 248
2023-03-15T09:00:16+01:00: Upload size: 59196521 (6%)
2023-03-15T09:00:16+01:00: Duplicates: 240+0 (96%)
2023-03-15T09:00:16+01:00: Compression: 12%
2023-03-15T09:00:16+01:00: successfully closed dynamic index 2

Note the line with "Duplicates" - the first number is the amount of chunks that the client re-used (did not upload, but just told the server "this chunk is part of this snapshot as well"), the second (0 in this case, and usually

) is the amount of chunks the server detected as "uploaded, but already exists in the chunk store". "Upload size" vs "Size" is also a good indicator whether the client was able to skip data or not

isaacntk · Sep 30, 2023

Ah alright, so the full backup would still be sent over, that's unfortunate, I'll explore moving it to a VM or the local PBS option then. Thanks!

Also if there is a way to leave a feature request for resuming backups or if there already is a ticket I can track please let me know. Will mark this thread as solved

Dunuin · Sep 30, 2023

isaacntk said:
Also if there is a way to leave a feature request for resuming backups or if there already is a ticket I can track please let me know.

bugzilla.proxmox.com is the place where you can open a feature request (but requires new registration first).

[SOLVED] Large backups over unreliable network

isaacntk

Active Member

Dunuin

Distinguished Member

isaacntk

Active Member

Dunuin

Distinguished Member

fabian

Proxmox Staff Member

isaacntk

Active Member

Dunuin

Distinguished Member

fabian

Proxmox Staff Member

isaacntk

Active Member

Dunuin

Distinguished Member

We value your privacy