Unexpected backup failures

Do you have also another, unrelated datastore which is backed by S3 running on this instance?
No, I do not.

I'm asking since your backtrace includes proxmox-s3-client code paths and there was a recent bugfix for hanging proxy's [1]
I also keep PBS up to date with the no-subscription repository.

Code:
proxmox-backup: 4.2.0 (running kernel: 7.0.6-2-pve)
proxmox-backup-server: 4.2.1-1 (running version: 4.2.1)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-7.0.2-6-pve-signed: 7.0.2-6
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ifupdown2: 3.3.0-1+pmx12
libjs-extjs: 7.0.0-5
proxmox-backup-docs: 4.2.1-1
proxmox-backup-client: 4.2.1-1
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.3
pve-xtermjs: 6.0.0-1
smartmontools: 7.5-pve2
zfsutils-linux: 2.4.2-pve1
 
Last edited:
After days of relative calm (few, if any, backup failures), yesterday there were 10.

Any news on the proxy.backtrace file analysis?
 
Any news on the proxy.backtrace file analysis?
As stated previously, the proxy backtrace points towards lock contention on chunk insert being the issue here. Chunks being inserted into the datastore are synced up by using a chunk store lock. Please open an issue at bugzilla.proxmox.com for this, we should have a look on how to reduce lock contention.

For the time being I'm afraid you will have to workaround by rescheduling some of your backups for them to run without issues.
 
Do you have also another, unrelated datastore which is backed by S3 running on this instance? I'm asking since your backtrace includes proxmox-s3-client code paths and there was a recent bugfix for hanging proxy's [1], so this might be related (not packaged yet at the time of writing).

[1] https://git.proxmox.com/?p=pro, so I would suggest upgrading.xmox-backup.git;a=commit;h=23400016322c7a6981f111558e8d22666e32ee8c
This patch is available in version 4.2.2-1. It won't reduce lock contention, but it should make the server more responsive.

Would you mind upgrading and sending a new version of the backtrace? As far as I can tell the one in the Bugzilla issue is still from an older version.
 
This patch is available in version 4.2.2-1
I'm already on that version...

Code:
proxmox-backup: 4.2.0 (running kernel: 7.0.12-1-pve)
proxmox-backup-server: 4.2.2-1 (running version: 4.2.2)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.12-1
proxmox-kernel-7.0.12-1-pve-signed: 7.0.12-1
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ifupdown2: 3.3.0-1+pmx12
libjs-extjs: 7.0.0-5
proxmox-backup-docs: 4.2.2-1
proxmox-backup-client: 4.2.2-1
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.5
pve-xtermjs: 6.0.0-1
smartmontools: 7.5-pve2
zfsutils-linux: 2.4.2-pve1

Yesterday 19 backups failed.
 
If your setup can't handle all backups at the same time, don't do it...
It's also highly recommended to not start backups from all nodes at the very same time, since PBS might get overloaded with handling the initial setup for each at the same time.
 
What is wrong with my setup to handle these backups?

The CPU is not stressed, the disks can handle greater loads, the RAM has as much as it needs.
There are days when I have no failed backups and others when I have more than 20.

If I did it, it's because I need it and I don't see why a machine of that level can't do it.
In fact, the problems are these locks, and it's not clear what they're caused by.
 
Last edited:
I'm already on that version...
so what about the new gdb backtrace?
the disks can handle greater loads
Just to be sure, can you benchmark the sustained write speed of the storage and collect disk and network utilization metrics during those backups?

Please also provide some estimates of how much data is backed up each time. It would be good to know how far off we are from the optimal network and disk bandwidth.

There are days when I have no failed backups and others when I have more than 20.
Backups are done incrementally so there could just be a lot more changes on some days.
 
so what about the new gdb backtrace?
I can do it again, but I doubt the instructions are different.
If I can, I'll do it this evening.

Just to be sure, can you benchmark the sustained write speed of the storage and collect disk and network utilization metrics during those backups?
I base my reasoning on the fact that there are times when performance, both in terms of speed and IO, is greater than when the problems occur.
If they were disk problems, I would expect the IOwaits to increase dramatically, but this doesn't happen.

Backups are done incrementally so there could just be a lot more changes on some days.
It could be, it's a very complex aspect to investigate.
 
if they were disk problems, I would expect the IOwaits to increase dramatically
If only one core is submitting and waiting for IO wouldn't that result in a maximum average iowait of 1/core_count? i.e. at most 3% if 31 cores are idle.

One core can still easily saturate the entire disk bandwidth, so it would be better to look at the actual transfer rates.

Or am I misunderstanding something?

It could be, it's a very complex aspect to investigate.
Have you done some back-of-the-envelope calculation? e.g. if you assume each server sends 10GB then the hardware needs to be able to transmit and store those 180 GB within a reasonable time.
 
Or am I misunderstanding something?
I hope that in that case the core will be changed.

if you assume each server sends 10GB then the hardware needs to be able to transmit and store those 180 GB within a reasonable time.
With seems to be a bandwidth issue because that there are times when performance, both in terms of speed and IO, is greater than when the problems occur.