Backup jobs hang when running in parallel

wailer · Dec 10, 2024

Hi there,

We are seeing a strange behaviour when backing up to PBS. We have 1 job to backup all nodes (2 nodes) everyday at 21:00. This job , hangs on both nodes after very few minutes of starting without any error.

The funny thing is , when we select only one node , the backup process does not hang. This started happening 2 o 3 days ago and this is my infrastructure:

2 x PVE nodes

~# pveversion
pve-manager/8.3.1/fb48e850ef9dde27 (running kernel: 6.8.12-5-pve)

2 x PBS hosts (1 onsite - 1 offsite)

The problem happens in all nodes and all PBS hosts. I have tried playing with advanced options, lowering bandwidth and IO workers.. but nothing works.

Jobs get stuck like this:

Any idea?

fabian · Dec 10, 2024

anything visible in the PBS-side task logs? are those also still running when you run into the issue?

wailer · Dec 10, 2024

Hi there,

In the PBS side, the first thing I notice when this happens is that there is no cpu load or IO delay. When I check the logs for a stuck job I see a normal log like this:

fabian · Dec 10, 2024

please post the full task log of both sides..

wailer · Dec 11, 2024

Hello ,

Sorry for the delay, but I did a full backup cycle which worked well , backing up host by host , with only 1 concurrent task running.

Some minutes ago I started the normal backup job , which runs backup for both nodes and jobs started failing, this time with a different behaviour.

Here I post logs 1 VM

PVE SIDE

INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2024-12-11 21:17:25
INFO: status = running
INFO: VM Name: ISP-Tacacs
INFO: include disk 'scsi0' 'SSD1-HA:vm-104-disk-2' 20G
INFO: backup mode: snapshot
INFO: bandwidth limit: 204800 KiB/s
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2024-12-11T20:17:25Z'
INFO: enabling encryption
enable RBD image features this kernel RBD drivers supports: fast-diff,object-map,exclusive-lock
/dev/rbd12
INFO: drive-scsi0: attaching fleecing image SSD1-HA:vm-104-fleece-0 to QEMU
INFO: started backup task '1531310c-3269-4a31-b917-530ada8959dc'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (1.4 GiB of 20.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 1.4 GiB dirty of 20.0 GiB total
INFO: 27% (408.0 MiB of 1.4 GiB) in 3s, read: 136.0 MiB/s, write: 132.0 MiB/s
INFO: 43% (644.0 MiB of 1.4 GiB) in 6s, read: 78.7 MiB/s, write: 78.7 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 9s, read: 32.0 MiB/s, write: 32.0 MiB/s
INFO: 50% (740.0 MiB of 1.4 GiB) in 16m 39s, read: 0 B/s, write: 0 B/s
ERROR: backup write data failed: command error: protocol canceled
INFO: aborting backup job
INFO: resuming VM again
trying to acquire cfs lock 'storage-SSD1-HA' ...
trying to acquire cfs lock 'storage-SSD1-HA' ...
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
Removing image: 4% complete...
Removing image: 5% complete...
Removing image: 6% complete...
Removing image: 7% complete...
Removing image: 8% complete...
Removing image: 9% complete...
Removing image: 10% complete...
Removing image: 11% complete...
Removing image: 12% complete...
Removing image: 13% complete...
Removing image: 14% complete...
Removing image: 15% complete...
Removing image: 16% complete...
Removing image: 17% complete...
Removing image: 18% complete...
Removing image: 19% complete...
Removing image: 20% complete...
Removing image: 21% complete...
Removing image: 22% complete...
Removing image: 23% complete...
Removing image: 24% complete...
Removing image: 25% complete...
Removing image: 26% complete...
Removing image: 27% complete...
Removing image: 28% complete...
Removing image: 29% complete...
Removing image: 30% complete...
Removing image: 31% complete...
Removing image: 32% complete...
Removing image: 33% complete...
Removing image: 34% complete...
Removing image: 35% complete...
Removing image: 36% complete...
Removing image: 37% complete...
Removing image: 38% complete...
Removing image: 39% complete...
Removing image: 40% complete...
Removing image: 41% complete...
Removing image: 42% complete...
Removing image: 43% complete...
Removing image: 44% complete...
Removing image: 45% complete...
Removing image: 46% complete...
Removing image: 47% complete...
Removing image: 48% complete...
Removing image: 49% complete...
Removing image: 50% complete...
Removing image: 51% complete...
Removing image: 52% complete...
Removing image: 53% complete...
Removing image: 54% complete...
Removing image: 55% complete...
Removing image: 56% complete...
Removing image: 57% complete...
Removing image: 58% complete...
Removing image: 59% complete...
Removing image: 60% complete...
Removing image: 61% complete...
Removing image: 62% complete...
Removing image: 63% complete...
Removing image: 64% complete...
Removing image: 65% complete...
Removing image: 66% complete...
Removing image: 67% complete...
Removing image: 68% complete...
Removing image: 69% complete...
Removing image: 70% complete...
Removing image: 71% complete...
Removing image: 72% complete...
Removing image: 73% complete...
Removing image: 74% complete...
Removing image: 75% complete...
Removing image: 76% complete...
Removing image: 77% complete...
Removing image: 78% complete...
Removing image: 79% complete...
Removing image: 80% complete...
Removing image: 81% complete...
Removing image: 82% complete...
Removing image: 83% complete...
Removing image: 84% complete...
Removing image: 85% complete...
Removing image: 86% complete...
Removing image: 87% complete...
Removing image: 88% complete...
Removing image: 89% complete...
Removing image: 90% complete...
Removing image: 91% complete...
Removing image: 92% complete...
Removing image: 93% complete...
Removing image: 94% complete...
Removing image: 95% complete...
Removing image: 96% complete...
Removing image: 97% complete...
Removing image: 98% complete...
Removing image: 99% complete...
Removing image: 100% complete...done.
ERROR: Backup of VM 104 failed - backup write data failed: command error: protocol canceled
INFO: Failed at 2024-12-11 21:34:08

PBS SIDE

2024-12-11T21:17:25+01:00: download 'index.json.blob' from previous backup.
2024-12-11T21:17:26+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: download 'drive-scsi0.img.fidx' from previous backup.
2024-12-11T21:17:26+01:00: created new fixed index 1 ("vm/104/2024-12-11T20:17:25Z/drive-scsi0.img.fidx")
2024-12-11T21:17:26+01:00: add blob "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z/qemu-server.conf.blob" (389 bytes, comp: 389)
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: backup failed: connection error: host unreachable
2024-12-11T21:34:13+01:00: removing failed backup
2024-12-11T21:34:13+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: host unreachable
2024-12-11T21:34:13+01:00: removing backup snapshot "/mnt/datastore/ZFS1/vm/104/2024-12-11T20:17:25Z"
2024-12-11T21:34:13+01:00: TASK ERROR: connection error: host unreachable

wailer · Dec 11, 2024

Also , right at the time where 2 jobs started running in parallel , I start seeing a flood of these connections errors to PBS datastore:

Dec 11 21:00:36 hostx pvestatd[2392]: pbs2 error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)
Dec 11 21:00:46 hostx pvestatd[2392]: pbs2: error fetching datastores - 500 Can't connect to x.x.x.x:8007 (Connection timed out)

fabian · Dec 12, 2024

yeah, either your network or your PBS seems overloaded and drops the connection?

wailer · Dec 16, 2024

Just letting you now that it was a network problem. A bandwith license limitation from a device that was in the path. After bypassing everything started working smooth.

Thanks!

Search

Search

Backup jobs hang when running in parallel

wailer

New Member

fabian

Proxmox Staff Member

wailer

New Member

fabian

Proxmox Staff Member

wailer

New Member

wailer

New Member

fabian

Proxmox Staff Member

wailer

New Member

We value your privacy