Backup jobs hang when running in parallel

wailer

New Member
Nov 5, 2024
8
2
3
Hi there,

We are seeing a strange behaviour when backing up to PBS. We have 1 job to backup all nodes (2 nodes) everyday at 21:00. This job , hangs on both nodes after very few minutes of starting without any error.

The funny thing is , when we select only one node , the backup process does not hang. This started happening 2 o 3 days ago and this is my infrastructure:

2 x PVE nodes

~# pveversion
pve-manager/8.3.1/fb48e850ef9dde27 (running kernel: 6.8.12-5-pve)

2 x PBS hosts (1 onsite - 1 offsite)

The problem happens in all nodes and all PBS hosts. I have tried playing with advanced options, lowering bandwidth and IO workers.. but nothing works.

1733823196223.png

Jobs get stuck like this:

1733823331000.png

Any idea?
 
anything visible in the PBS-side task logs? are those also still running when you run into the issue?
 
Hi there,

In the PBS side, the first thing I notice when this happens is that there is no cpu load or IO delay. When I check the logs for a stuck job I see a normal log like this:

1733828707686.png
 
please post the full task log of both sides..
 
Hello ,

Sorry for the delay, but I did a full backup cycle which worked well , backing up host by host , with only 1 concurrent task running.

Some minutes ago I started the normal backup job , which runs backup for both nodes and jobs started failing, this time with a different behaviour.

Here I post logs 1 VM

PVE SIDE

INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2024-12-11 21:17:25
INFO: status = running
INFO: VM Name: ISP-Tacacs
INFO: include disk 'scsi0' 'SSD1-HA:vm-104-disk-2' 20G
INFO: backup mode: snapshot
INFO: bandwidth limit: 204800 KiB/s
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2024-12-11T20:17:25Z'
INFO: enabling encryption
enable RBD image features this kernel RBD drivers supports: fast-diff,object-map,exclusive-lock
/dev/rbd12
INFO: drive-scsi0: attaching fleecing image SSD1-HA:vm-104-fleece-0 to QEMU
INFO: started backup task '1531310c-3269-4a31-b917-530ada8959dc'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (1.4 GiB of 20.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 1.4 GiB dirty of 20.0 GiB total
INFO: 27% (408.0 MiB of 1.4 GiB) in 3s, read: 136.0 MiB/s, write: 132.0 MiB/s
INFO: 43% (644.0 MiB of 1.4 GiB) in 6s, read: 78.7 MiB/s, write: 78.7 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 9s, read: 32.0 MiB/s, write: 32.0 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 16m 39s, read: 0 B/s, write: 0 B/s
ERROR: backup write data failed: command error: protocol canceled
INFO: aborting backup job
INFO: resuming VM again
trying to acquire cfs lock 'storage-SSD1-HA' ...
trying to acquire cfs lock 'storage-SSD1-HA' ...
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
Removing image: 4% complete...
Removing image: 5% complete...
Removing image: 6% complete...
Removing image: 7% complete...
Removing image: 8% complete...
Removing image: 9% complete...
Removing image: 10% complete...
Removing image: 11% complete...
Removing image: 12% complete...
Removing image: 13% complete...
Removing image: 14% complete...
Removing image: 15% complete...
Removing image: 16% complete...
Removing image: 17% complete...
Removing image: 18% complete...
Removing image: 19% complete...
Removing image: 20% complete...
Removing image: 21% complete...
Removing image: 22% complete...
Removing image: 23% complete...
Removing image: 24% complete...
Removing image: 25% complete...
Removing image: 26% complete...
Removing image: 27% complete...
Removing image: 28% complete...
Removing image: 29% complete...
Removing image: 30% complete...
Removing image: 31% complete...
Removing image: 32% complete...
Removing image: 33% complete...
Removing image: 34% complete...
Removing image: 35% complete...
Removing image: 36% complete...
Removing image: 37% complete...
Removing image: 38% complete...
Removing image: 39% complete...
Removing image: 40% complete...
Removing image: 41% complete...
Removing image: 42% complete...
Removing image: 43% complete...
Removing image: 44% complete...
Removing image: 45% complete...
Removing image: 46% complete...
Removing image: 47% complete...
Removing image: 48% complete...
Removing image: 49% complete...
Removing image: 50% complete...
Removing image: 51% complete...
Removing image: 52% complete...
Removing image: 53% complete...
Removing image: 54% complete...
Removing image: 55% complete...
Removing image: 56% complete...
Removing image: 57% complete...
Removing image: 58% complete...
Removing image: 59% complete...
Removing image: 60% complete...
Removing image: 61% complete...
Removing image: 62% complete...
Removing image: 63% complete...
Removing image: 64% complete...
Removing image: 65% complete...
Removing image: 66% complete...
Removing image: 67% complete...
Removing image: 68% complete...
Removing image: 69% complete...
Removing image: 70% complete...
Removing image: 71% complete...
Removing image: 72% complete...
Removing image: 73% complete...
Removing image: 74% complete...
Removing image: 75% complete...
Removing image: 76% complete...
Removing image: 77% complete...
Removing image: 78% complete...
Removing image: 79% complete...
Removing image: 80% complete...
Removing image: 81% complete...
Removing image: 82% complete...
Removing image: 83% complete...
Removing image: 84% complete...
Removing image: 85% complete...
Removing image: 86% complete...
Removing image: 87% complete...
Removing image: 88% complete...
Removing image: 89% complete...
Removing image: 90% complete...
Removing image: 91% complete...
Removing image: 92% complete...
Removing image: 93% complete...
Removing image: 94% complete...
Removing image: 95% complete...
Removing image: 96% complete...
Removing image: 97% complete...
Removing image: 98% complete...
Removing image: 99% complete...
Removing image: 100% complete...done.
ERROR: Backup of VM 104 failed - backup write data failed: command error: protocol canceled
INFO: Failed at 2024-12-11 21:34:08

PBS SIDE

2024-12-11T21:17:25+01:00: download 'index.json.blob' from previous backup.
2024-12-11T21:17:26+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: download 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: created new fixed index 1 ("vm/104/2024-12-11T20:17:25Z/drive-scsi0.img.fidx")
2024-12-11T21:17:26+01:00: add blob "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z/qemu-server.conf.blob" (389 bytes, comp: 389)
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: backup failed: connection error: host unreachable
2024-12-11T21:34:13+01:00: removing failed backup
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: removing backup snapshot "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z"
2024-12-11T21:34:13+01:00: TASK ERROR: connection error: host unreachable
 
Also , right at the time where 2 jobs started running in parallel , I start seeing a flood of these connections errors to PBS datastore:

Dec 11 21:00:36 hostx pvestatd[2392]: pbs2 error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)
Dec 11 21:00:46 hostx pvestatd[2392]: pbs2: error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)
 
yeah, either your network or your PBS seems overloaded and drops the connection?
 
Just letting you now that it was a network problem. A bandwith license limitation from a device that was in the path. After bypassing everything started working smooth.

Thanks!
 
  • Like
Reactions: fabian