Failing backup job

mir · Oct 26, 2014

Now and then a thread pops up where someone has problems with a backup job were backups of some of the VMs fails. I have investigated this problem and I might have found the cause for this problem.

The problems boils down to two specific issues:

Limited IO in either the backup storage or the network
The configured backup job contains VMs from more than one Proxmox node

Why is the above issues a problem?
The problem is that the backup job only serializes backups on one Proxmox node which means that if your backup job contains VMs from more than one node the backup job will start a backup from the backup list on each Proxmox node in parallel. Expressed mathematically f(x) = n * b(p); n = Proxmox nodes; b = backup from node p.

How can I overcome this problem?
To prevent running parallel backups create a backup job for each Proxmox node and ensure the backup job only contains VMs which runs on the same proxmox node.

Is this bullet proof?
Yes and no.
Yes: If you always backup every VMs then it is bullet proof if you use selection mode: all
No: If you only want to backup running VMs then you will have a problem since over time the list of running VMs on a given Proxmox node will eventually change due to migration, new VMs can be added, VMs can be removed, VMs can be stopped so this lists needs to be overlooked on a daily basis.

How can this limitation be solved?
Add a new selection mode called running which well select all VMs on the configured node with status running.

jleg · Oct 27, 2014

mir said:
How can I overcome this problem?
To prevent running parallel backups create a backup job for each Proxmox node and ensure the backup job only contains VMs which runs on the same proxmox node.

we also tried something in this direction, how ever, there's a problem: since one cannot configure a backup job being the "successor" of another one, the time schedule has to be "guessed" somehow. This does not work reliably at least for us, sometimes jobs take longer than expected, or longer than "measured" before, or it is "system immanent" that backup times vary substantially (differential vs. full backup).
In these cases, it could happen that the waiting job "times out" - afair after 3 hours of waiting it gives up, and produces an "backup error".
So yes, we're also still looking for better solutions...

mir · Oct 27, 2014

You could configure backup jobs on each node to run on different days?

jleg · Oct 27, 2014

mir said:
You could configure backup jobs on each node to run on different days?

...sure, but obviously only in case you have the needed scope to skip days for a backup job; in our case it's a matter of "backup needed daily", with "spreading" the jobs over the night...

dietmar · Oct 27, 2014

jleg said:
So yes, we're also still looking for better solutions...

The problem is that we can still not reproduce this bug. Any ideas how to reliable reproduce this bug?

mir · Oct 27, 2014

dietmar said:
The problem is that we can still not reproduce this bug. Any ideas how to reliable reproduce this bug?

Just an idea:
1) Create a 3 node cluster
2) Install a total number of 25 VM's and CTs mixed OS
3) VM disk size should vary in sizes from 25 - 250 GB
4) Add a storage on a slow NAS (ARM processor) NFS. 2 - 4 disk with consumer grade SATA; green line is especially slow. Format with RAID 1 or 5 (software RAID)
5) Create a backup job containing all VM's and CT's and schedule for a weekly backup to this NAS.

Search

Search

Failing backup job

mir

Famous Member

jleg

Member

mir

Famous Member

jleg

Member

dietmar

Proxmox Staff Member

mir

Famous Member