Backup jobs hang when running in parallel

wailer

New Member
Nov 5, 2024
8
2
3
Hi there,

We are seeing a strange behaviour when backing up to PBS. We have 1 job to backup all nodes (2 nodes) everyday at 21:00. This job , hangs on both nodes after very few minutes of starting without any error.

The funny thing is , when we select only one node , the backup process does not hang. This started happening 2 o 3 days ago and this is my infrastructure:

2 x PVE nodes

~# pveversion
pve-manager/8.3.1/fb48e850ef9dde27 (running kernel: 6.8.12-5-pve)

2 x PBS hosts (1 onsite - 1 offsite)

The problem happens in all nodes and all PBS hosts. I have tried playing with advanced options, lowering bandwidth and IO workers.. but nothing works.

1733823196223.png

Jobs get stuck like this:

1733823331000.png

Any idea?
 
anything visible in the PBS-side task logs? are those also still running when you run into the issue?
 
Hi there,

In the PBS side, the first thing I notice when this happens is that there is no cpu load or IO delay. When I check the logs for a stuck job I see a normal log like this:

1733828707686.png
 
please post the full task log of both sides..
 
Hello ,

Sorry for the delay, but I did a full backup cycle which worked well , backing up host by host , with only 1 concurrent task running.

Some minutes ago I started the normal backup job , which runs backup for both nodes and jobs started failing, this time with a different behaviour.

Here I post logs 1 VM

PVE SIDE

INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2024-12-11 21:17:25
INFO: status = running
INFO: VM Name: ISP-Tacacs
INFO: include disk 'scsi0' 'SSD1-HA:vm-104-disk-2' 20G
INFO: backup mode: snapshot
INFO: bandwidth limit: 204800 KiB/s
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2024-12-11T20:17:25Z'
INFO: enabling encryption
enable RBD image features this kernel RBD drivers supports: fast-diff,object-map,exclusive-lock
/dev/rbd12
INFO: drive-scsi0: attaching fleecing image SSD1-HA:vm-104-fleece-0 to QEMU
INFO: started backup task '1531310c-3269-4a31-b917-530ada8959dc'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (1.4 GiB of 20.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 1.4 GiB dirty of 20.0 GiB total
INFO: 27% (408.0 MiB of 1.4 GiB) in 3s, read: 136.0 MiB/s, write: 132.0 MiB/s
INFO: 43% (644.0 MiB of 1.4 GiB) in 6s, read: 78.7 MiB/s, write: 78.7 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 9s, read: 32.0 MiB/s, write: 32.0 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 16m 39s, read: 0 B/s, write: 0 B/s
ERROR: backup write data failed: command error: protocol canceled
INFO: aborting backup job
INFO: resuming VM again
trying to acquire cfs lock 'storage-SSD1-HA' ...
trying to acquire cfs lock 'storage-SSD1-HA' ...
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
Removing image: 4% complete...
Removing image: 5% complete...
Removing image: 6% complete...
Removing image: 7% complete...
Removing image: 8% complete...
Removing image: 9% complete...
Removing image: 10% complete...
Removing image: 11% complete...
Removing image: 12% complete...
Removing image: 13% complete...
Removing image: 14% complete...
Removing image: 15% complete...
Removing image: 16% complete...
Removing image: 17% complete...
Removing image: 18% complete...
Removing image: 19% complete...
Removing image: 20% complete...
Removing image: 21% complete...
Removing image: 22% complete...
Removing image: 23% complete...
Removing image: 24% complete...
Removing image: 25% complete...
Removing image: 26% complete...
Removing image: 27% complete...
Removing image: 28% complete...
Removing image: 29% complete...
Removing image: 30% complete...
Removing image: 31% complete...
Removing image: 32% complete...
Removing image: 33% complete...
Removing image: 34% complete...
Removing image: 35% complete...
Removing image: 36% complete...
Removing image: 37% complete...
Removing image: 38% complete...
Removing image: 39% complete...
Removing image: 40% complete...
Removing image: 41% complete...
Removing image: 42% complete...
Removing image: 43% complete...
Removing image: 44% complete...
Removing image: 45% complete...
Removing image: 46% complete...
Removing image: 47% complete...
Removing image: 48% complete...
Removing image: 49% complete...
Removing image: 50% complete...
Removing image: 51% complete...
Removing image: 52% complete...
Removing image: 53% complete...
Removing image: 54% complete...
Removing image: 55% complete...
Removing image: 56% complete...
Removing image: 57% complete...
Removing image: 58% complete...
Removing image: 59% complete...
Removing image: 60% complete...
Removing image: 61% complete...
Removing image: 62% complete...
Removing image: 63% complete...
Removing image: 64% complete...
Removing image: 65% complete...
Removing image: 66% complete...
Removing image: 67% complete...
Removing image: 68% complete...
Removing image: 69% complete...
Removing image: 70% complete...
Removing image: 71% complete...
Removing image: 72% complete...
Removing image: 73% complete...
Removing image: 74% complete...
Removing image: 75% complete...
Removing image: 76% complete...
Removing image: 77% complete...
Removing image: 78% complete...
Removing image: 79% complete...
Removing image: 80% complete...
Removing image: 81% complete...
Removing image: 82% complete...
Removing image: 83% complete...
Removing image: 84% complete...
Removing image: 85% complete...
Removing image: 86% complete...
Removing image: 87% complete...
Removing image: 88% complete...
Removing image: 89% complete...
Removing image: 90% complete...
Removing image: 91% complete...
Removing image: 92% complete...
Removing image: 93% complete...
Removing image: 94% complete...
Removing image: 95% complete...
Removing image: 96% complete...
Removing image: 97% complete...
Removing image: 98% complete...
Removing image: 99% complete...
Removing image: 100% complete...done.
ERROR: Backup of VM 104 failed - backup write data failed: command error: protocol canceled
INFO: Failed at 2024-12-11 21:34:08

PBS SIDE

2024-12-11T21:17:25+01:00: download 'index.json.blob' from previous backup.
2024-12-11T21:17:26+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: download 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: created new fixed index 1 ("vm/104/2024-12-11T20:17:25Z/drive-scsi0.img.fidx")
2024-12-11T21:17:26+01:00: add blob "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z/qemu-server.conf.blob" (389 bytes, comp: 389)
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: backup failed: connection error: host unreachable
2024-12-11T21:34:13+01:00: removing failed backup
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: removing backup snapshot "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z"
2024-12-11T21:34:13+01:00: TASK ERROR: connection error: host unreachable
 
Also , right at the time where 2 jobs started running in parallel , I start seeing a flood of these connections errors to PBS datastore:

Dec 11 21:00:36 hostx pvestatd[2392]: pbs2 error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)
Dec 11 21:00:46 hostx pvestatd[2392]: pbs2: error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)
 
yeah, either your network or your PBS seems overloaded and drops the connection?
 
Just letting you now that it was a network problem. A bandwith license limitation from a device that was in the path. After bypassing everything started working smooth.

Thanks!
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!