Backup to PBS fails (snapshot/stop/suspend) – connection drops mid-transfer

emir1984

New Member
Mar 11, 2024
7
0
1
Hey there,

I’m experiencing consistent backup failures when using Proxmox Backup Server (PBS), regardless of backup mode (snapshot, stop, or suspend). The VM runs fine otherwise, but as soon as a backup starts, the transfer begins normally and then suddenly stalls mid-process.
On the PBS side, no specific error appears beyond connection resets. MTU tests indicate fragmentation issues with anything above ~1300 bytes, even though both ends are configured with MTU 1500 and standard networking appears stable.

The same VM setup on another identical Proxmox node (hardware, config, PBS target) works without issue.

Example, a Backup creation is stuck at 38%. it varies all the time, sometimes it’s after a couple seconds, sometimes it runs till around 50%. from the moment where it’s stuck, there’s no traffic at all.
I even reinstalled the server and I just can’t figure it out.


Please advise how to further debug or resolve this. Let me know if logs or further details are required.
 
On the PBS side, no specific error appears beyond connection resets. MTU tests indicate fragmentation issues with anything above ~1300 bytes, even though both ends are configured with MTU 1500 and standard networking appears stable.
What NICs are used for the routes the backup traffic takes? There were similar reports where TCP segmentation offloading to the NIC firmware was at fault, see https://forum.proxmox.com/threads/pbs-sync-failed-each-time.113921/#post-573939. Although in these cases the backup failed with errors, not stalling as you describe.

Please advise how to further debug or resolve this. Let me know if logs or further details are required.
First of, please post the pveversion -v and proxmox-backup-manager version --verbose. Further, check the system journal on both sides for errors around the time the stalled backup occurs. While the backup is stall, can you ping the PBS from the PVE host?
 
What NICs are used for the routes the backup traffic takes? There were similar reports where TCP segmentation offloading to the NIC firmware was at fault, see https://forum.proxmox.com/threads/pbs-sync-failed-each-time.113921/#post-573939. Although in these cases the backup failed with errors, not stalling as you describe.


First of, please post the pveversion -v and proxmox-backup-manager version --verbose. Further, check the system journal on both sides for errors around the time the stalled backup occurs. While the backup is stall, can you ping the PBS from the PVE host?
Hi,

the NIC used on the PVE node is a BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller by Broadcom.

Unfortunately I can't tell which one is used on the PBS side as it is running within a KVM.

But - I can tell that creating backups of VMs works with the same PBS with other PVE nodes, even with the same hardware specs.

A couple days ago I had that problem already on another PVE node, now I upgraded the server to a completely new one with a freshly installed PVE instance. The only thing I did is migrating the VMs over from the old node to the new one. I migrated the VM-disks as well as the config files manually.

I've already tried so many things, nothing works.

I checked the journal on both the PVE and PBS nodes during the stalled backup – no obvious errors on either side, except for occasional connection resets on the PBS side.
While the backup is stalled, the ping to the PBS from the PVE host does not work when using packet sizes above ~1200 bytes with -M do. It shows message too long errors, even though MTU is set to 1500 on both ends. Smaller packets go through.
Regular traffic (SSH, web UI, etc.) works fine. Only during the backup does the connection appear to drop or degrade significantly.

This is pveversion -v:

Code:
root@root428:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

And this is proxmox-backup-manager version --verbose:

Code:
root@pbs45:~# proxmox-backup-manager version --verbose
proxmox-backup                      3.4.0         running kernel: 6.8.12-11-pve
proxmox-backup-server               3.4.1-1       running version: 3.4.1       
proxmox-kernel-helper               8.1.1                                     
proxmox-kernel-6.8.12-11-pve-signed 6.8.12-11                                 
proxmox-kernel-6.8                  6.8.12-11                                 
proxmox-kernel-6.8.12-9-pve-signed  6.8.12-9                                   
ifupdown2                           3.2.0-1+pmx11                             
libjs-extjs                         7.0.0-5                                   
proxmox-backup-docs                 3.4.1-1                                   
proxmox-backup-client               3.4.1-1                                   
proxmox-mail-forward                0.3.2                                     
proxmox-mini-journalreader          1.4.0                                     
proxmox-offline-mirror-helper       0.6.7                                     
proxmox-widget-toolkit              4.3.11                                     
pve-xtermjs                         5.5.0-2                                   
smartmontools                       7.3-pve1                                   
zfsutils-linux                      2.2.7-pve2


Thanks in advance
 
While the backup is stalled, the ping to the PBS from the PVE host does not work when using packet sizes above ~1200 bytes with -M do. It shows message too long errors, even though MTU is set to 1500 on both ends. Smaller packets go through.
Regular traffic (SSH, web UI, etc.) works fine. Only during the backup does the connection appear to drop or degrade significantly.
How does the network between the PVE and the PBS look like? Are they on the same subnet? Connected to the same switch? Are there proxies or network tunnels involved?
 
How does the network between the PVE and the PBS look like? Are they on the same subnet? Connected to the same switch? Are there proxies or network tunnels involved?
They are not on the same subnet, they are in different networks with different ASNs in different datacenters. No proxies or tunnels.
 
They are not on the same subnet, they are in different networks with different ASNs in different datacenters. No proxies or tunnels.
You could try and see which route the network packets take and which middlehost drops them. For example via mtr, which also has the -s option to set a packet size. You can install it via apt install mtr-tiny on your PVE host. And adapt the MTU to lower values on the relevant interfaces.
 
You could try and see which route the network packets take and which middlehost drops them. For example via mtr, which also has the -s option to set a packet size. You can install it via apt install mtr-tiny on your PVE host. And adapt the MTU to lower values on the relevant interfaces.
As said, I dont expect it to be a network related issue as it works from another PVE node from the same hosting provider (same ASN) to the same PBS without any issues.

On the other hand, even if I create a fresh VM and try to back it up, same scenario ...

It just doesnt make sense at all.

Any more ideas?
 
You could try and see which route the network packets take and which middlehost drops them. For example via mtr, which also has the -s option to set a packet size. You can install it via apt install mtr-tiny on your PVE host. And adapt the MTU to lower values on the relevant interfaces.
Hi,

I have a new update: Suddenly backups work for some VMs, but not for all.
 
I have a new update: Suddenly backups work for some VMs, but not for all.
How big are these VMs disks in size as compared to the ones where the backup fails? I wouldn't exclude networking issues just yet.

As said, I dont expect it to be a network related issue as it works from another PVE node from the same hosting provider (same ASN) to the same PBS without any issues.
So PBS behaves as expected for this PVE client, so I would suspect a (probably intermitten) networking issue to cause the issues for the other one. As said, I would continuously run mtr on the PVE pointing to the PBS to see possible network hickups.