OK, I know that this is not an issue that most people are dealing with.
We take backups of our Logging an monitoring solution every 20 min +10 (*:10, *:30 *:50). These backups are perfect. but we have a problem when syncing to our off-site backup in AWS. More often than not the remote sync runs at the same time as the local backup (or while the backup is being verified), I expect this to happen. But the entire sync job is marked as Failed. which is not really true. on the next run that backup will be synced, while the current one will Flag as a failure. and the cycle continues.
I Tried:
I have some thoughts on solutions.
Anyway These are just my ideas. and I am open to other suggestions.
Code:
2023-02-02T09:27:20-06:00: sync group vm/105 failed - unable to acquire lock on snapshot directory "/mnt/datastore/PBS-01_ZFS-1/vm/105/2023-02-02T15:30:03Z" - locked by another operation
We take backups of our Logging an monitoring solution every 20 min +10 (*:10, *:30 *:50). These backups are perfect. but we have a problem when syncing to our off-site backup in AWS. More often than not the remote sync runs at the same time as the local backup (or while the backup is being verified), I expect this to happen. But the entire sync job is marked as Failed. which is not really true. on the next run that backup will be synced, while the current one will Flag as a failure. and the cycle continues.
I Tried:
- Add an offset to the backups to minimize conflict, It reduced the conflicts from every hour to every few hours. but 13 fails in 24 hours is the best I can get.
- To make a separate sync task for VM/105 but that solution would still generate errors.
- I also disabled Verification after backup which is a setting I prefer to have on. when I scheduled verification at specific times, I tend to get two failed sync errors per day which is better, but its also almost every backup is failing to sync for 2-3 hours.
I have some thoughts on solutions.
- Retry syncing of all locked backups. at the end of the sync job. This could possibly provide time for those backups to be released.
- Allow some sort of Timeout for locked items. Maybe this would be paired with #1. maybe start a job that attempts to sync the locked items after 90 seconds or something. this attempts to get the locked backups again before throwing an error.
- The ability to flag a Backup or Sync job as "skipable", if any one job gets skipped more than once then throw an error, otherwise throw a warning.
Anyway These are just my ideas. and I am open to other suggestions.
Last edited: