I specified since the beginning that I'm using Proxmox Backup Server. The retention is set to "keep all backups" for now so I'm sure that's not the issue.
I never said that that was a problem. I also read in the first post that the PBS is used.
Also, I'm not giving small bits of info:
We got a 3 node cluster using PVE
We got a VM running PBS
We backup on that PBS
The backups are done on a "by pool" basis.
So I still don't know whether all jobs are for a pool or not, I still don't know which job is for which pool. I still don't know your structure, for example that you create a pool for each customer.
You want support, don't you? If not, then the current information is sufficient, but then I'm out. If you are interested in support, you will have to show us/explain a little about your current settings and what you want to achieve with them. Then we could give you tips on how you can adapt your jobs if necessary to meet your requirements.
The question is: how do people backup VMs using pool backup in a multi-backup, hundreds of VM environments. Because, as I said, right now we got around 50 VMs and like 10 pools (each job from that screenshot is a pool), and I'm seeing the lock issue once/twice a week.
That was my answer
I currently have a job for each node that runs at different times and is supposed to secure the pool. Daily backups are usually sufficient for us. For the most important VMs, backups are also created within them. Otherwise, customers have to pay for backups and very few people do that.
But again, it doesn't help you much if you know what other people are doing because it doesn't solve your specific problem. It is much more effective if you explain your requirements to us. If the others understand your requirements, they can also share their solutions with you and you may be able to solve a partial problem from them.
I saw that there's a feature request to somehow enable parallel backups but it's already quite old and looks to be abandoned. This leads me to think that this is not an issue for others, so maybe I'm doing something wrong.
Nobody said you were doing it wrong, you just might have requirements for the integrated solution that it can't offer today. This is a limitation that can certainly be solved once you understand exactly what your requirements are.
Exploring a simpler but similar scenario:
2 backup jobs for two different pools, saving the backups on the PBS. One job starts at 00:00, the other starts at 05:00. If the first job doesn't finish before 05:00, the 2nd job is waiting for the vzdump.lock to be freed. If this doesn't happen in 3 hours (from what I read, this is the hard coded timeout), the 2nd job will return an error. Besides setting the 2nd job to a later time, since that's not really scalable in a real environment, is there a way to go around this?
As mentioned, you could distribute the jobs differently across time and nodes. You can also try to place your jobs in the retention on the PBS in order to possibly save one or two jobs.
But I'm more surprised that your jobs on a node seem to run for several hours and you even run into a timeout. If I haven't restarted the VMs, the backups are all done and done in under 10 minutes. Even if it wasn't the case, my infrastructure probably wouldn't even be busy with backups for a total of 2 hours.
If you back up a pool several times a day, then the delta should be significantly lower. Then I would worry even more about the long backup time.
You might be able to optimize the configurations here and thus significantly limit the backup time. Maybe this is just a symptom that jobs are catching up.