Unexpected backup failures

vaschthestampede · Jun 4, 2026

Chris said:
Do you have also another, unrelated datastore which is backed by S3 running on this instance?

No, I do not.

Chris said:
I'm asking since your backtrace includes proxmox-s3-client code paths and there was a recent bugfix for hanging proxy's [1]

I also keep PBS up to date with the no-subscription repository.

Code:

proxmox-backup: 4.2.0 (running kernel: 7.0.6-2-pve)
proxmox-backup-server: 4.2.1-1 (running version: 4.2.1)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-7.0.2-6-pve-signed: 7.0.2-6
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ifupdown2: 3.3.0-1+pmx12
libjs-extjs: 7.0.0-5
proxmox-backup-docs: 4.2.1-1
proxmox-backup-client: 4.2.1-1
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.3
pve-xtermjs: 6.0.0-1
smartmontools: 7.5-pve2
zfsutils-linux: 2.4.2-pve1

vaschthestampede · Jun 19, 2026

After days of relative calm (few, if any, backup failures), yesterday there were 10.

Any news on the proxy.backtrace file analysis?

vaschthestampede · Monday at 08:38

Please help , failed backups are continuing.

Chris · Monday at 09:37

vaschthestampede said:
Any news on the proxy.backtrace file analysis?

As stated previously, the proxy backtrace points towards lock contention on chunk insert being the issue here. Chunks being inserted into the datastore are synced up by using a chunk store lock. Please open an issue at bugzilla.proxmox.com for this, we should have a look on how to reduce lock contention.

For the time being I'm afraid you will have to workaround by rescheduling some of your backups for them to run without issues.

Robert Obkircher · 2026-06-30T12:44:30+0200

Chris said:
Do you have also another, unrelated datastore which is backed by S3 running on this instance? I'm asking since your backtrace includes proxmox-s3-client code paths and there was a recent bugfix for hanging proxy's [1], so this might be related (not packaged yet at the time of writing).

[1] https://git.proxmox.com/?p=pro, so I would suggest upgrading.xmox-backup.git;a=commit;h=23400016322c7a6981f111558e8d22666e32ee8c

This patch is available in version 4.2.2-1. It won't reduce lock contention, but it should make the server more responsive.

Would you mind upgrading and sending a new version of the backtrace? As far as I can tell the one in the Bugzilla issue is still from an older version.

vaschthestampede · 2026-06-30T14:14:18+0200

Robert Obkircher said:
This patch is available in version 4.2.2-1

I'm already on that version...

Code:

proxmox-backup: 4.2.0 (running kernel: 7.0.12-1-pve)
proxmox-backup-server: 4.2.2-1 (running version: 4.2.2)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.12-1
proxmox-kernel-7.0.12-1-pve-signed: 7.0.12-1
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ifupdown2: 3.3.0-1+pmx12
libjs-extjs: 7.0.0-5
proxmox-backup-docs: 4.2.2-1
proxmox-backup-client: 4.2.2-1
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.5
pve-xtermjs: 6.0.0-1
smartmontools: 7.5-pve2
zfsutils-linux: 2.4.2-pve1

Yesterday 19 backups failed.

fiona · 2026-07-01T10:16:24+0200

If your setup can't handle all backups at the same time, don't do it...

fiona said:
It's also highly recommended to not start backups from all nodes at the very same time, since PBS might get overloaded with handling the initial setup for each at the same time.

vaschthestampede · 2026-07-01T10:49:43+0200

What is wrong with my setup to handle these backups?

The CPU is not stressed, the disks can handle greater loads, the RAM has as much as it needs.
There are days when I have no failed backups and others when I have more than 20.

If I did it, it's because I need it and I don't see why a machine of that level can't do it.
In fact, the problems are these locks, and it's not clear what they're caused by.

Robert Obkircher · 2026-07-01T11:44:13+0200

vaschthestampede said:
I'm already on that version...

so what about the new gdb backtrace?

vaschthestampede said:
the disks can handle greater loads

Just to be sure, can you benchmark the sustained write speed of the storage and collect disk and network utilization metrics during those backups?

Please also provide some estimates of how much data is backed up each time. It would be good to know how far off we are from the optimal network and disk bandwidth.

vaschthestampede said:
There are days when I have no failed backups and others when I have more than 20.

Backups are done incrementally so there could just be a lot more changes on some days.

vaschthestampede · 2026-07-01T11:58:08+0200

Robert Obkircher said:
so what about the new gdb backtrace?

I can do it again, but I doubt the instructions are different.
If I can, I'll do it this evening.

Robert Obkircher said:
Just to be sure, can you benchmark the sustained write speed of the storage and collect disk and network utilization metrics during those backups?

I base my reasoning on the fact that there are times when performance, both in terms of speed and IO, is greater than when the problems occur.
If they were disk problems, I would expect the IOwaits to increase dramatically, but this doesn't happen.

Robert Obkircher said:
Backups are done incrementally so there could just be a lot more changes on some days.

It could be, it's a very complex aspect to investigate.

Robert Obkircher · 2026-07-01T14:43:56+0200

vaschthestampede said:
if they were disk problems, I would expect the IOwaits to increase dramatically

If only one core is submitting and waiting for IO wouldn't that result in a maximum average iowait of 1/core_count? i.e. at most 3% if 31 cores are idle.

One core can still easily saturate the entire disk bandwidth, so it would be better to look at the actual transfer rates.

Or am I misunderstanding something?

vaschthestampede said:
It could be, it's a very complex aspect to investigate.

Have you done some back-of-the-envelope calculation? e.g. if you assume each server sends 10GB then the hardware needs to be able to transmit and store those 180 GB within a reasonable time.

vaschthestampede · 2026-07-01T15:42:31+0200

Robert Obkircher said:
Or am I misunderstanding something?

I hope that in that case the core will be changed.

Robert Obkircher said:
if you assume each server sends 10GB then the hardware needs to be able to transmit and store those 180 GB within a reasonable time.

With seems to be a bandwidth issue because that there are times when performance, both in terms of speed and IO, is greater than when the problems occur.

Unexpected backup failures

vaschthestampede

Well-Known Member

vaschthestampede

Well-Known Member

vaschthestampede

Well-Known Member

Chris

Proxmox Staff Member

Robert Obkircher

Member

vaschthestampede

Well-Known Member

fiona

Proxmox Staff Member

vaschthestampede

Well-Known Member

Robert Obkircher

Member

vaschthestampede

Well-Known Member

Robert Obkircher

Member

vaschthestampede

Well-Known Member

We value your privacy