unable to acquire lock on snapshot directory - locked by another operation

guerby

Active Member
Nov 22, 2020
86
18
28
51
Hi,

We have three PBS servers for our PVE, two on-site and one off-site.

All PVE backup to the same on-site PBS earch night, about 250 VMs and 60 TiB datastore

Then a remote sync job is launched both the on the second on-site PBS and a bit later on the off-site one.

So we're following the recommended architecture described in this thread:

https://forum.proxmox.com/threads/process-a-second-backup-or-sync-the-first.139132/

However about once per month one of the remote sync fails with

Code:
...
sync group vm/363 failed - unable to acquire lock on snapshot directory "/mnt/datastore/datastore1/vm/363/2024-02-17T01:43:28Z" - locked by another operation
...
TASK ERROR: sync failed with some errors.

Additional information from our analysis of these failures:
- Only one VM failed to remote sync.
- No other jobs are running at this time on the PBS.

We assume we're unlucky both remote sync jobs end up trying to sync the same VM snapshot at the same time.

Any suggestion on how to deal with this issue?

May be PBS could:
- retry at the end of the remote sync the snapshots which have failed due to "locked by another operation"
- if my understanding is correct remote sync are read-only on the remote repo and so should not conflict if "multiple reader" locks are used.

If needed we can open a bugzille or a support ticket (we have basic support on two of the three PBS involved).

Thanks!
 
Hi!
could you post the full syslog on the sending and on the pulling side?
Theoretically we do use shared locks for reading, so two sync jobs reading from the same datastore at the same time should work. Maybe check again to be sure there was no backup or other job going on on both datastores.
 
  • Like
Reactions: guerby
Hi!
could you post the full syslog on the sending and on the pulling side?
Theoretically we do use shared locks for reading, so two sync jobs reading from the same datastore at the same time should work. Maybe check again to be sure there was no backup or other job going on on both datastores.

Hi,

Indeed after looking at the logs of a failed sync I noticed that a backup job to the first onsite PBS was taking way longer than usual and so was interfering with the sync job from the other PBS.

So in this cas only a retry later or at the end on the specific snapshot which failed sync would make the sync job successful.

Thanks!
 
  • Like
Reactions: ggoller
Hi, we see the same error, when a sync job start during a backup running on the pve, we receive the error " unable to acquire lock on snapshot directory",

Maybe this could be managed better, for example let the pbs know that a backup is running and avoid that backup in the sync (as is not completed yet) and let it back for the next sync, without made the sync failing.

What you think?
 
Hey, I ran into this too.

I have a complex system with Backup and Sync Jobs between a number of PBS servers.
Then there's GC, Pruning, and Verify Jobs.

I posted about it, got some feedback from staff.
They seemed to think its not usually a problem.
I did a spreadsheet of all my jobs to sort out any overlaps and rescheduled a few.

I don't know if I really fixed it.
Its infrequent enough now that its dropped off my radar.
 
Hey, I ran into this too.

I have a complex system with Backup and Sync Jobs between a number of PBS servers.
Then there's GC, Pruning, and Verify Jobs.

I posted about it, got some feedback from staff.
They seemed to think its not usually a problem.
I did a spreadsheet of all my jobs to sort out any overlaps and rescheduled a few.

I don't know if I really fixed it.
Its infrequent enough now that its dropped off my radar.
hi, my backup runs every 4 hours at minutes 00 and sync jobs every hour at minutes 00, probably i resolve this by setting sync jobx at minutes 30 for example. But if a backup takes more than 30 minutes there is always a chanhce that this could happen.
 
the race window is a lot smaller than "back up task" and "sync task" overlapping thankfully - there is an issue where a snapshot already becomes visible but is still locked (at the very end of the backup task), if that exact snapshot is attempted to be pulled then you get this error (unless I misunderstand what you are doing, and you are mixing syncs and backups to the same target, which is a bad idea!)
 
mixing syncs and backups to the same target, which is a bad idea!)

Hmm. Sync doesn't get talked enough about here.

I segregate backup and sync content with per-cluster vmid restrictions, namespaces, and backup ownership.
- VMID should be unique across my whole enterprise, so we ain't mixing backups that way.
- Namespaces for datacenters and groups, w/sync and /back sub-folders.
- Backups are owned by backup user. Syncs are owned by sync user.

I think I'm good. But maybe not. What did that mean?
 
Last edited:
what I mean is that you shouldn't do the following:

- client (PVE) system does backups for group X in namespace N on PBS A
- (target) PBS A pulls backups for group X in namespace N from (source) PBS B (or B pushed to A)

if you do that, you will often run into locking (and other) issues, as it is essentially two clients attempting to backup into the same target.

if you only do (sync direction switched)

- client (PVE system) does backups for group X in namespace N on PBS A
- PBS B pulls from PBS A (or A pushes to B)

then you should be fine - the occasional locking warning for the last snapshot in the group can occur, if it becomes visible to the sync while the backup is not yet 100% done, but that snapshot will be picked up by the next sync run.
 
Thank you @fabian .

I utilize a 2 tier structure with primary baremetal SSD servers doing the backup and secondary virtual servers as site-to-site sync and long-term capacity hubs.

It sounds like what I have built is your second, approved case. In fact, I only occasionally run into locking issues, and rescheduling has somewhat alleviated them.

....
Mmm.
And after thinking about it, my /back and /sync subfolder Namespaces should prevent the aforementioned more serious sort of collision where two jobs try to write to the same space.
And again after more consideration, there's no reasonable way to prevent the secondary case on a system that does both back and acts as a sync source ... other than studying the conflicts and rescheduling.
 
Last edited: