Failing backup job

mir

Famous Member
Apr 14, 2012
3,568
127
133
Copenhagen, Denmark
Now and then a thread pops up where someone has problems with a backup job were backups of some of the VMs fails. I have investigated this problem and I might have found the cause for this problem.

The problems boils down to two specific issues:
  1. Limited IO in either the backup storage or the network
  2. The configured backup job contains VMs from more than one Proxmox node

Why is the above issues a problem?
The problem is that the backup job only serializes backups on one Proxmox node which means that if your backup job contains VMs from more than one node the backup job will start a backup from the backup list on each Proxmox node in parallel. Expressed mathematically f(x) = n * b(p); n = Proxmox nodes; b = backup from node p.

How can I overcome this problem?
To prevent running parallel backups create a backup job for each Proxmox node and ensure the backup job only contains VMs which runs on the same proxmox node.

Is this bullet proof?
Yes and no.
Yes: If you always backup every VMs then it is bullet proof if you use selection mode: all
No: If you only want to backup running VMs then you will have a problem since over time the list of running VMs on a given Proxmox node will eventually change due to migration, new VMs can be added, VMs can be removed, VMs can be stopped so this lists needs to be overlooked on a daily basis.

How can this limitation be solved?
Add a new selection mode called running which well select all VMs on the configured node with status running.
 
Last edited:
How can I overcome this problem?
To prevent running parallel backups create a backup job for each Proxmox node and ensure the backup job only contains VMs which runs on the same proxmox node.

we also tried something in this direction, how ever, there's a problem: since one cannot configure a backup job being the "successor" of another one, the time schedule has to be "guessed" somehow. This does not work reliably at least for us, sometimes jobs take longer than expected, or longer than "measured" before, or it is "system immanent" that backup times vary substantially (differential vs. full backup).
In these cases, it could happen that the waiting job "times out" - afair after 3 hours of waiting it gives up, and produces an "backup error".
So yes, we're also still looking for better solutions... :)
 
You could configure backup jobs on each node to run on different days?

...sure, but obviously only in case you have the needed scope to skip days for a backup job; in our case it's a matter of "backup needed daily", with "spreading" the jobs over the night...
 
The problem is that we can still not reproduce this bug. Any ideas how to reliable reproduce this bug?
Just an idea:
1) Create a 3 node cluster
2) Install a total number of 25 VM's and CTs mixed OS
3) VM disk size should vary in sizes from 25 - 250 GB
4) Add a storage on a slow NAS (ARM processor) NFS. 2 - 4 disk with consumer grade SATA; green line is especially slow. Format with RAID 1 or 5 (software RAID)
5) Create a backup job containing all VM's and CT's and schedule for a weekly backup to this NAS.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!