PBS getting stuck during backup

maccmiles

New Member
Nov 26, 2021
8
0
1
26
Recently over the past few months I've noticed when running a backup from Proxmox VE to a remote PBS instance, the process seems to gets stuck every time.


on PBS Side: never gets past dynamic index:
2022-10-27T15:35:51-04:00: starting new backup on datastore 'DCB': "ct/155/2022-10-27T19:35:50Z" 2022-10-27T15:35:51-04:00: add blob "/mnt/datastore/DataStore/ct/155/2022-10-27T19:35:50Z/pct.conf.blob" (252 bytes, comp: 252) 2022-10-27T15:35:51-04:00: created new dynamic index 1 ("ct/155/2022-10-27T19:35:50Z/catalog.pcat1.didx") 2022-10-27T15:35:51-04:00: created new dynamic index 2 ("ct/155/2022-10-27T19:35:50Z/root.pxar.didx")

on Proxmox VE Side (redacted): gets to 'upload' but doesn't seem to initiate
INFO: starting new backup job: vzdump 155 --storage pbs-test --remove 0 --mode stop --node host2 INFO: Starting Backup of VM 155 (lxc) INFO: Backup started at 2022-10-27 15:35:50 INFO: status = stopped INFO: backup mode: stop INFO: ionice priority: 7 INFO: CT Name: bakuptest INFO: including mount point rootfs ('/') in backup INFO: including mount point mp0 ('/dataDrive') in backup INFO: creating Proxmox Backup Server archive 'ct/155/2022-10-27T19:35:50Z' INFO: run: lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup --crypt-mode=none pct.conf:/var/tmp/vzdumptmp2868424_155/etc/vzdump/pct.conf root.pxar:/mnt/vzsnap0 --include-dev /mnt/vzsnap0/./ --include-dev /mnt/vzsnap0/./dataDrive --skip-lost-and-found --exclude=/tmp/?* --exclude=/var/tmp/?* --exclude=/var/run/?*.pid --backup-type ct --backup-id 155 --backup-time 1666899350 --repository backupuser@pam@pbs.example.com:DataStore INFO: Starting backup: ct/155/2022-10-27T19:35:50Z INFO: Client name: host2 INFO: Starting backup protocol: Thu Oct 27 15:35:51 2022 INFO: No previous manifest available. INFO: Upload config file '/var/tmp/vzdumptmp2868424_155/etc/vzdump/pct.conf' to 'backupuser@pam@pbs.example.com:DataStore' as pct.conf.blob INFO: Upload directory '/mnt/vzsnap0' to 'backupuser@pam@pbs.example.com:DataStore' as root.pxar.didx

The status doesn't fail, and when left to its own fruition will run for a week until I cancel the task.

Not sure where to dig exactly on this one, I've tried updates, reboots, and remove / re-adds of the PBS to the cluster so far, but nothing seems to be getting them to move along. Currently experienced both with CT's and VM's. Any insight would be appreciated.
 
how long did you wait? did you check the network for traffic? if it's slow to upload, there is not much progress to show
 
Hi @dcsapak,

The longest one I have logged right now is 3d 18h, though I've tried CT's and VM's of various sizes and they all exhibit the same traits.

Previously the progress used to show for the transfer and have a percentage on the pve side of things, that's sort of the main notice that brought up concern.

Right now, i'm trying a CT with a 1TB volume, that's been going for about 20h now. No progress is showing on the job output, Running slurm on PBS shows it's sipping something at 1004.28 KB/s which to me is very abnormal.

I'd expect at minimum to be pulling 25-50mbps assuming my boxes are handling other requests, if they're idle I'd expect it to saturate a 100mb service to the DC.
 
what's the source and target storage ? note that a container backup must (ofc) read all files (and 1TB is not small)
also what's your network connection between the pve and pbs like? (bandwidth/latency/general network layout?)

i'd check e.g. with 'atop' on both sides where a possible bottleneck could be
 
Source for this particular run is a CT with 1TB disk, running off local LVM ( I work in big data so relative to the numbers I usually see this is chump change). I initially spawned this thread as i experienced this issue with seemingly CT/VM of all sizes. I created some very small CT's for testing and they didn't complete either.

Target is PBS VM instance over open internet, with a local logical volume.

SRC atop status: ~2%cpu NET | eno24 ---- | | pcki 6217 | pcko 12601
DST atop status: ~22%cpu NET | enp2s0 ---- | | pcki 8589 | pcko 3784

Service to PVE colo is 100meg synchronous.
Service to PBS is 400d/30u.

Even at half saturation I would expect the transfer to be done in ~ 44 hours, and usually completes within 20 hrs. This current run is on hour 100.

Subsequently, PBS instance is peaking at 1500KB/s and i'm watching 0.01GB grow on the datastore every few minutes, so it does seem to be sucking some bytes. Though, at the current rate I'd anticipate it to take 1400+ hours to complete. I'm interested into why there's no progress shown on the running tasks themselves, historically I believe it gave percentages, though now that it's on day 4, perhaps this is just an artifact of slowness.

I've made no network or topology changes recently, the remote PBS Site returns speedtests in accordance to what is expected, as well as a S2S connection test hitting the 100Meg mark fro download from PVE to PBS Networks with external tools. Is there an update or patch I missed? or some configurations I should be poking at as to why this is going so slowly seemingly out of the blue? I can always give it more resources, the %mem free seems not to be great, though I'm not overflowing into swap. I wouldn't imagine it to make that significant of a performance decrease, but I've noticed the PBS is sporting minimum requirements so I'll give it some more as an experiment.
 
mhmmm ok, could you try to do a benchmark to the datastore :

Code:
proxmox-backup-client benchmark

(see https://pbs.proxmox.com/docs/backup-client.html for how to give the repository and user)

if there is some data flowing to the pbs, it means it isn't completely stalling, so something must be responsible for the slowness
how big is the latency between the sites? any proxy in between that could interfering with the traffic?
 
root@pbs:~# proxmox-backup-client benchmark
SHA256 speed: 183.58 MB/s
Compression speed: 194.84 MB/s
Decompress speed: 240.48 MB/s
AES256/GCM speed: 494.24 MB/s
Verify speed: 102.94 MB/s
┌───────────────────────────────────┬───────────────────┐
│ Name │ Value │
╞═══════════════════════════════════╪═══════════════════╡
│ TLS (maximal backup upload speed) │ not tested │
├───────────────────────────────────┼───────────────────┤
│ SHA256 checksum computation speed │ 183.58 MB/s (9%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 compression speed │ 194.84 MB/s (26%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 decompression speed │ 240.48 MB/s (20%) │
├───────────────────────────────────┼───────────────────┤
│ Chunk verification speed │ 102.94 MB/s (14%) │
├───────────────────────────────────┼───────────────────┤
│ AES256 GCM encryption speed │ 494.24 MB/s (14%) │
└───────────────────────────────────┴───────────────────┘
 
you have to do a benchmark to a repository (see the docs link i sent) otherwise there is no data leaving the client and it's basically a cpu/memory benchmark
 
Sorry, really long week, was preoccupied bricking one of my hypervisors.

root@pvehost:~# proxmox-backup-client benchmark --repository root@pam@pbs.example.com:DatastoreTest
Password for "root@pam": ****************************************
Uploaded 10 chunks in 26 seconds.
Time per request: 2612036 microseconds.
TLS speed: 1.61 MB/s
SHA256 speed: 240.78 MB/s
Compression speed: 276.09 MB/s
Decompress speed: 400.37 MB/s
AES256/GCM speed: 554.36 MB/s
Verify speed: 149.48 MB/s
┌───────────────────────────────────┬───────────────────┐
│ Name │ Value │
╞═══════════════════════════════════╪═══════════════════╡
│ TLS (maximal backup upload speed) │ 1.61 MB/s (0%) │
├───────────────────────────────────┼───────────────────┤
│ SHA256 checksum computation speed │ 240.78 MB/s (12%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 compression speed │ 276.09 MB/s (37%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 decompression speed │ 400.37 MB/s (33%) │
├───────────────────────────────────┼───────────────────┤
│ Chunk verification speed │ 149.48 MB/s (20%) │
├───────────────────────────────────┼───────────────────┤
│ AES256 GCM encryption speed │ 554.36 MB/s (15%) │
└───────────────────────────────────┴───────────────────┘
 
TLS speed: 1.61 MB/s
ok seems there is actually some problem in the network

what's the latency / bandwidth between the sites? anything in between that might cause trouble? (reverse proxy/firewall/etc)

also could you check the network with iperf/iperf3 between your pve and pbs ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!