Backup to Proxmox Backup server error

hac3ru · Jan 21, 2024

Hello,

We just setup a new 3 node cluster and, obviously, we wanted to take backups of the VMs. All was well until a few days ago, when we started to get an error:

Code:

ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout

I know that this happens because another backup is already underway but my question is: how are you guys running backups in large envs? Right now, we have around 50 VMs and, while the backups are set to run at different hours, based on how much data was changed in a VM from last time, the backup operations sometimes overlap. I'm thinking that, if we get to 100 - 200 VMs, I'll have a really hard time running backups properly.
P.S. we are backing up to a Proxmox Backup Server and we're using Pool backups. Our backup schedule looks like this:

If I add more pools, which we will need to, I'll have a real issue.

P.S. also, why is there only one dump process allowed? From my point of view - obviously I might be missing something - but each backup job could run in parallel, as long as they don't need to backup the same VM.

sb-jw · Jan 21, 2024

Somehow I'm not quite sure what you want with the individual jobs. But aren't the first and second rules identical? Maybe there is also missing information that there are other pools.

I don't understand exactly whether you treat everyone the same or differently in the jobs. But in newer versions you can set jobs per tag and thus only secure the affected VMs in a more targeted manner.

Otherwise, you can also consider whether they might streamline the rules a bit and shift them to PBS.

I currently have a job for each node that runs at different times and is supposed to secure the pool. Daily backups are usually sufficient for us. For the most important VMs, backups are also created within them. Otherwise, customers have to pay for backups and very few people do that.

hac3ru · Jan 21, 2024

The rules target individual pools. Each rule is targeting a different pool.

Otherwise, you can also consider whether they might streamline the rules a bit and shift them to PBS.

I'm sorry, I didn't get this. I am already using Proxmox Backup Server. The problem is when a job is running for a longer time and another job starts in the meantime. The new job is blocked, until the first one finishes.

Each pool represent a client for us. This makes management easier for us, clients can have temporary access to their resources, etc. But it makes backup a pain as stated before.

Can't do per node rules, as VMs might be migrated from one node to another.

Any clue how we could fix this?

sb-jw · Jan 21, 2024

hac3ru said:
The rules target individual pools. Each rule is targeting a different pool.

hac3ru said:
Each pool represent a client for us. This makes management easier for us, clients can have temporary access to their resources, etc. But it makes backup a pain as stated before.

It might be smart if you gave us extensive information about your setup here and not just a small overview that prevents us from getting a general overview. This automatically leads to me noticing something and pointing it out and then you saying that it's slightly different.

hac3ru said:
Can't do per node rules, as VMs might be migrated from one node to another.

Okay, and what's the problem with that? You say you classify by pool, so there is no problem here. If there are no VMs from the pool on the node, the job is finished after a few seconds. If VMs run from the pool, they are backed up.

hac3ru said:
I'm sorry, I didn't get this. I am already using Proxmox Backup Server. The problem is when a job is running for a longer time and another job starts in the meantime. The new job is blocked, until the first one finishes.

As I noted above, you're just throwing us a small nugget of information here. I don't have an overall view of your setup, so I can only recommend the obvious based on the information I have. You have to judge whether it suits you or not.
I Backup to the PBS and have no retention on PVE, the retention runs entirely on the PBS. You could probably save yourself the monthly jobs and just tell the PBS to save them. This may also work for some other jobs. If jobs #1 and #2 were e.g. for the same pool you could limit the time from 01:00 - 23:00 at #2.

hac3ru · Jan 21, 2024

I specified since the beginning that I'm using Proxmox Backup Server. The retention is set to "keep all backups" for now so I'm sure that's not the issue.

Also, I'm not giving small bits of info:
We got a 3 node cluster using PVE
We got a VM running PBS
We backup on that PBS
The backups are done on a "by pool" basis.
The issue is that, if a pool is having a ton of things changed in between the backups, the backup procedure itself is taking too long, so the next backup - of a different pool - is trying to start. Since a backup is already running, the vzdump.lock file is there, so the 2nd backup cannot run. If it can't get the lock in a certain amount of time, the backup fails.

The question is: how do people backup VMs using pool backup in a multi-backup, hundreds of VM environments. Because, as I said, right now we got around 50 VMs and like 10 pools (each job from that screenshot is a pool), and I'm seeing the lock issue once/twice a week.

I saw that there's a feature request to somehow enable parallel backups but it's already quite old and looks to be abandoned. This leads me to think that this is not an issue for others, so maybe I'm doing something wrong.

Exploring a simpler but similar scenario:
2 backup jobs for two different pools, saving the backups on the PBS. One job starts at 00:00, the other starts at 05:00. If the first job doesn't finish before 05:00, the 2nd job is waiting for the vzdump.lock to be freed. If this doesn't happen in 3 hours (from what I read, this is the hard coded timeout), the 2nd job will return an error. Besides setting the 2nd job to a later time, since that's not really scalable in a real environment, is there a way to go around this?

sb-jw · Jan 21, 2024

hac3ru said:
I specified since the beginning that I'm using Proxmox Backup Server. The retention is set to "keep all backups" for now so I'm sure that's not the issue.

I never said that that was a problem. I also read in the first post that the PBS is used.

hac3ru said:
Also, I'm not giving small bits of info:
We got a 3 node cluster using PVE
We got a VM running PBS
We backup on that PBS
The backups are done on a "by pool" basis.

So I still don't know whether all jobs are for a pool or not, I still don't know which job is for which pool. I still don't know your structure, for example that you create a pool for each customer.
You want support, don't you? If not, then the current information is sufficient, but then I'm out. If you are interested in support, you will have to show us/explain a little about your current settings and what you want to achieve with them. Then we could give you tips on how you can adapt your jobs if necessary to meet your requirements.

hac3ru said:
The question is: how do people backup VMs using pool backup in a multi-backup, hundreds of VM environments. Because, as I said, right now we got around 50 VMs and like 10 pools (each job from that screenshot is a pool), and I'm seeing the lock issue once/twice a week.

That was my answer

sb-jw said:
I currently have a job for each node that runs at different times and is supposed to secure the pool. Daily backups are usually sufficient for us. For the most important VMs, backups are also created within them. Otherwise, customers have to pay for backups and very few people do that.

But again, it doesn't help you much if you know what other people are doing because it doesn't solve your specific problem. It is much more effective if you explain your requirements to us. If the others understand your requirements, they can also share their solutions with you and you may be able to solve a partial problem from them.

hac3ru said:
I saw that there's a feature request to somehow enable parallel backups but it's already quite old and looks to be abandoned. This leads me to think that this is not an issue for others, so maybe I'm doing something wrong.

Nobody said you were doing it wrong, you just might have requirements for the integrated solution that it can't offer today. This is a limitation that can certainly be solved once you understand exactly what your requirements are.

hac3ru said:
Exploring a simpler but similar scenario:
2 backup jobs for two different pools, saving the backups on the PBS. One job starts at 00:00, the other starts at 05:00. If the first job doesn't finish before 05:00, the 2nd job is waiting for the vzdump.lock to be freed. If this doesn't happen in 3 hours (from what I read, this is the hard coded timeout), the 2nd job will return an error. Besides setting the 2nd job to a later time, since that's not really scalable in a real environment, is there a way to go around this?

As mentioned, you could distribute the jobs differently across time and nodes. You can also try to place your jobs in the retention on the PBS in order to possibly save one or two jobs.

But I'm more surprised that your jobs on a node seem to run for several hours and you even run into a timeout. If I haven't restarted the VMs, the backups are all done and done in under 10 minutes. Even if it wasn't the case, my infrastructure probably wouldn't even be busy with backups for a total of 2 hours.
If you back up a pool several times a day, then the delta should be significantly lower. Then I would worry even more about the long backup time.

You might be able to optimize the configurations here and thus significantly limit the backup time. Maybe this is just a symptom that jobs are catching up.

Backup to Proxmox Backup server error

hac3ru

Well-Known Member

sb-jw

Famous Member

hac3ru

Well-Known Member

sb-jw

Famous Member

hac3ru

Well-Known Member

sb-jw

Famous Member

We value your privacy