Unexpected backup failures

How reproducible is the issue?
It's been happening every day lately.

You could try to check if using a local datastore not backed by iSCSI also produces the timeout errors.
I can't; the locally attached disks will be used for other backups.
I'll be able to check if those fail as well.

What I can tell you is that it's not the first time something like this has happened but on much less performing hardware and I blamed that.
And I didn't use iSCSI there, the drivers were local.
 
Hi @vaschthestampede,
do you have IO thread enabled for your VM disks? If not, it's highly recommended to do so. It's also highly recommended to not start backups from all nodes at the very same time, since PBS might get overloaded with handling the initial setup for each at the same time. Is the network for the storage and for PBS separate? It's also recommended to be.
 
It's also highly recommended to not start backups from all nodes at the very same time, since PBS might get overloaded with handling the initial setup for each at the same time.

My requirement is to have all the backups at that time.
In fact, the goal is to have all the backups of all the VMs at the same time!
 
It's been happening every day lately.
Since it is reproducible, please install gdb and the debug symbols on the PBS host via apt install gdb proxmox-backup-sever-dbgsym and when the hang appears the next time run gdb --batch --ex 't a a bt' -p $(pidof proxmox-backup-proxy) > proxy.backtrace. Ideally before the timeout on the PVE side. Then attach the backtrace here. That can tell us more about what is going on.
 
  • Like
Reactions: vaschthestampede
please install gdb and the debug symbols on the PBS host via apt install gdb proxmox-backup-sever-dbgsym
root@Elefante:~# apt install gdb proxmox-backup-sever-dbgsym
Error: Unable to locate package proxmox-backup-sever-dbgsym

Do i need to add any repositories?
 
My requirement is to have all the backups at that time.
Does the issue occur if you stagger the backups, i.e. start the jobs a few minutes delayed between the nodes? Having all backups start at the very same time might be the cause of your issue.
 
Does the issue occur if you stagger the backups, i.e. start the jobs a few minutes delayed between the nodes?
Can't do that.
My requirement is to have all the backups at that time.
In fact, the goal is to have all the backups of all the VMs at the same time.
Then the data can be written even later but the state of the VM to be backed up should be that of 6:30 pm, for each VM.

Having all backups start at the very same time might be the cause of your issue.
But what's the problem since the monitoring doesn't report anything under stress?
If there is no hardware limit (as it seems) then it is a software limit and this is a problem, a big one, am I wrong?
 
But what's the problem since the monitoring doesn't report anything under stress?
Did you also check your iscsi target? Regardless, I would be interested in a backtrace while there are no metric updates.
 
Did you also check your iscsi target?
Yes, that one also shows no problems and the iSCSI connection is stable.

would be interested in a backtrace while there are no metric updates.
Very interesting.
One piece of information that might be useful is that PBS receives backup data (and send metric) through a boundary between two interfaces, so I also did everything I could to avoid saturation issues.
The iSCSI has a dedicated port and the connection is direct; there's no switch in between.
 
Last edited: