Backups failing despite apparently good connection.

m_w · Apr 4, 2025

I'm encountering an issue with backups failing that i can't seem to find a solution to (a week of fiddling and searching....).
Hoping someone might have some insight or can point me in a better direction.

Versions: 3.3.3, 8.3.3

PBS: VM on TrueNAS, Datastore iSCSI share on that box (>30TB available).
Remotes: Baremetal in remote datacenters.
VPN between using Cloudflare Warp->Warp. MTUs set to 1280.

Problem is consistent. Connection between clients and PBS seem to be solid.
It consistently writes a couple of files and then fails when it tries to move the image data.
Keeps trying for ~10 minutes, then gives up and deletes what it had sent.

Exactly the same thing for VMs or LXCs and from multiple remotes.
The fact that it writes some files would seem to indicate there isn't a permissions issue (or a connection issue...).

I'm stumped.

Client Log:

INFO: starting new backup job: vzdump 631 --mode snapshot --notification-mode auto --notes-template '{{guestname}}' --node [[REDACT]] --storage pxmx-bkup-srv.[[REDACT]] --remove 0

INFO: Starting Backup of VM 631 (qemu)
INFO: Backup started at 2025-04-04 13:29:11
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: [[REDACT]]
INFO: include disk 'scsi0' 'local:631/vm-631-disk-0.qcow2' 500G
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/631/2025-04-04T13:29:11Z'
INFO: starting kvm to execute backup task
INFO: started backup task '0ef1245e-5009-47de-b741-bb0ad37a2a8f'
INFO: scsi0: dirty-bitmap status: created new
INFO: 0% (400.0 MiB of 500.0 GiB) in 3s, read: 133.3 MiB/s, write: 104.0 MiB/s
INFO: 0% (428.0 MiB of 500.0 GiB) in 15m 53s, read: 30.2 KiB/s, write: 30.2 KiB/s
ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: timed out
INFO: aborting backup job
INFO: stopping kvm after backup task

ERROR: Backup of VM 631 failed - backup write data failed: command error: write_data upload error: pipelined request failed: timed out

INFO: Failed at 2025-04-04 13:45:07
INFO: Backup job finished with errors
TASK ERROR: job errors

PBS Log:
Apr 04 07:45:07 pxmx-bkup-srv proxmox-backup-proxy[694]: removing failed backup
Apr 04 07:45:07 pxmx-bkup-srv proxmox-backup-proxy[694]: backup failed: connection error: bytes remaining on stream

Apr 04 07:29:45 pxmx-bkup-srv proxmox-backup-proxy[694]: error during snapshot file listing: 'unable to load blob '"/mnt/datastore/pxmx-bkup/ns/gth/vm/631/2025-04-04T13:29:11Z/index.json.blob"' - No such file or directory (os error 2)'

Apr 04 07:29:13 pxmx-bkup-srv proxmox-backup-proxy[694]: add blob "/mnt/datastore/pxmx-bkup/ns/gth/vm/631/2025-04-04T13:29:11Z/qemu-server.conf.blob" (372 bytes, comp: 372)

Apr 04 07:29:13 pxmx-bkup-srv proxmox-backup-proxy[694]: created new fixed index 1 ("ns/gth/vm/631/2025-04-04T13:29:11Z/drive-scsi0.img.fidx")

Apr 04 07:29:13 pxmx-bkup-srv proxmox-backup-proxy[694]: starting new backup on datastore 'truenas-pxmx-bkup' from ::ffff:10.63.1.100: "ns/gth/vm/631/2025-04-04T13:29:11Z"

During backup (before it fails and deletes things:
root@pxmx-bkup-srv:~# ls -al /mnt/datastore/pxmx-bkup/ns/gth/vm/631/2025-04-04T14\:04\:28Z/
total 8
drwxr-xr-x 2 backup backup 79 Apr 4 08:04 .
drwxr-xr-x 3 backup backup 59 Apr 4 08:04 ..
-rw-r--r-- 1 backup backup 4100096 Apr 4 08:04 drive-scsi0.img.tmp_fidx
-rw-r--r-- 1 backup backup 372 Apr 4 08:04 qemu-server.conf.blob

aj@root · Apr 8, 2025

This smells like an MTU problem.

The reason I say that is that

You have a custom MTU - always a strong indicator
Small packets work fine - like statuses, establishing the connection, and creating the 372 byte config file
During the first large (4mb) file write it's only allocated the space for .tmp_fidx, but never finishes writing it
"Pipeline" - I think this is referring to a network pipe, not a software task list, though I could be wrong

You've got a lot of tech in that stack. I would sanity check each component:
- Can you write to the storage volume as root logged into the PBS?
- Can you do a local backup on the same network as the PBS? (create a Proxmox VM on that TrueNAS)
- Can you use rsync or scp to copy over a large (i.e. 4mb) file? (through each of the network layers, independently and combined)

If you can narrow down in which layer the problem is occurring by doing smaller sanity checks in each layer, you'll eventually find the layer at which it fails, and that will help immensely in being able to get help. And the best help might be in a TrueNAS forum or a CloudFlare forum. Or it might be here.
(I say that because there are a number of Proxmox people who will aggressively solve Proxmox-related issues, but when the issue doesn't seem to be related to Proxmox, the line can go silent because people here aren't necessarily experts in those other things, or in that particular layering)

Also, did it ever work?

m_w · Apr 8, 2025

It definitely smells like an MTU/fragmentation issue... but I started with defaults (when the problem first occurred), set everything I could find over to 1280 (problem still there) and now I've set it all back to defaults. No real improvement or change in any of the configs.

Most infuriating is this WAS working over a similar CF WARP tunnel in the past with basically default settings. Rebuilt some things with what I thought to be an identical setup and now, it's being stubborn. The goal here is to backhaul several sites for offsite backup and all of them are having the same issue.

Short term, I've started putting local PBS installs at the remote sites. I tried setting them up to then feed the central location via sync and a similar issue appears (though with much less verbose logging for the sync process). So, I don't think we're dealing with a local storage issues as the local machines happily backup to local PBS and the local PBS to remote PBS sync should have no need for additional temp storage (I believe?).

It has to either be the tunnels or the central PBS install. I may just rebuild the whole central setup from scratch... this is as good an opportunity as any.

Now, while this does look like MTU on the surface, I'm actually starting to think it might be a split route at the central location. Some traceroute and tracepaths are getting odd results. Ultimately, I can't be the only person using PBS over CF WARP (Which I understand to be just Wireguard under the hood) so I find it odd I'm not seeing a bunch of forum posts of others with similar issues if the tunnel MTU restriction was actually an issue. But, how does one end up with a split route that only impacts large packets?

Anyways - thanks for the couple clues/thoughts above. It helps having a second voice giving their take on things that generally aligns with the areas I have been troubleshooting. I need to dig into this more when I have time. For now the offsite is a 'nice to have' (these VMs are pretty easy to rebuild in a pinch).

I know it CAN work, since it did before.

Search

Search

Backups failing despite apparently good connection.

m_w

New Member

aj@root

Member

m_w

New Member

We value your privacy