verification job is taking forever...

Sep 26, 2023
60
0
6
Here's my config.
server is site A does the backups and also does a sync and verify. all works fine here.
server is site B does a schedule pull for backups and also has a schedule 'verify' job.

I have several large servers that I take backups on every weekly and some once a month. those might be 300+gb in side.
the replication of those jobs works w/o issue - it's just the verify process that's taking forever (12+ hours due to larger server syncs).
For instance - I have a 350Gb file. It was backed up and verified "locally" at the main site in a short time. The sync took a bit longer as it's a slightly larger file.
I have a scheduled 'every 2 day' verification job at the remote site that runs against the 'new' files that are there. Whenever this happens, I get 'failed backup' job messages due to the fact that the verify is still - or has that image locked' during it's verification process. And, of course - the verification of other jobs have to wait till the 'large server' if verified before it can continue.

Question: If I have done a backup, locally - and verified it locally - and I am using ZFS for my storage environment ...is the verification still needed at the other side? i can't find a way to verify those few large servers - individually, with another job unless I do something creative like creating a 'name server' and putting those servers in it..and creating another job to reference those servers.

Anyone else experiensing this or have recommendations on resolving a 'verify' job which takes forever (12+hours)? I'd thought that the verification was only happening on the 'remote' side once the file have been moved over..but is verification trying to check what is "now" local against what it pulled on the main site? If so, this seem inefficient.
 
Whenever this happens, I get 'failed backup' job messages due to the fact that the verify is still - or has that image locked' during it's verification process
So do I understand you correctly, you are not only pulling in backup snapshots from your Proxmox Backup Server on the remote side A, but are also performing backups directly to the same datastore on side B? If so, maybe you should consider using a namespace to separate the snaphshots pulled in by the sync job from these created by the direct backups (note that they will be deduplicated also when being placed in different namespaces, as that happens on a datastore level).

Question: If I have done a backup, locally - and verified it locally - and I am using ZFS for my storage environment ...is the verification still needed at the other side?
Well, the snapshots chunks and index files are checksum-ed during the sync job, so if the ZFS storage correctly persists them to disk, you should be fine also without the additional verify, as ZFS will do the data integrity checks for you in that case.
Note however that this will depend on the redundancy of your ZFS storage. If you have no redundancy, than ZFS can detect the corruption, but cannot fix it. In that case a verify job is needed for Proxmox Backup Server to see and flag potentially corrupt chunks and re-upload these by a backup or sync job if possible. Also, without the verify job you rely solely on the ZFS implementation, if you have a verify job, a potential data corruption bug in ZFS might be detected early on.

Whenever this happens, I get 'failed backup' job messages due to the fact that the verify is still - or has that image locked' during it's verification process
For what operation do you get this message? Can you post the full task log for this?

Edit: Fixed typo
 
Last edited:
So do I understand you correctly, you are not only pulling in backup snapshots from your Proxmox Backup Server on the remote side A, but are also performing backups directly to the same datastore on side B? If so, maybe you should consider using a namespace to separate the snaphshots pulled in by the sync job from these created by the direct backups (note that they will be deduplicated also when being placed in different namespaces, as that happens on a datastore level).


Well, the snapshots chunks and index files are checksum-ed during the sync job, so if the ZFS storage correctly persists them to disk, you should be fine also without the additional verify, as ZFS will do the data integrity checks for you in that case.
Note however that this will depend on the redundancy of your ZFS storage. If you have no redundancy, than ZFS can detect the corruption, but cannot fix it. In that case a verify job is needed for Proxmox Backup Server to see and flag potentially corrupt chunks and re-upload these by a backup or sync job if possible. Also, without the verify job you rely solely on the ZFS implementation, if you have a verify job, a potential data corruption bug in ZFS might be detected early on.


For what operation do you get this message? Can you post the full task log for this?

Edit: Fixed typo
Not pulling backups in at site A.
Site A has 2 servers, PVE and PBS. PVE does the backups onto the shared storage on the PBS backup. Pretty normal.
Site A is also doing a verify of the backups daily to be sure the backups are 'good' (from the PVE environment).

Site B also has 2 servers, PVE and PBS. Site B, PBS (proxmox backup server) does a schedule pull (hourly) , from Site A to the (shared) datastore in Site B area.
Site B also is doing a verify, against the data within the datastore - that was pulled from Site A to be sure the backups are good. I'd of thought that this process would so the following - re-download the small metafile (associated) with the index file and compare it against the "now" downloaded file from the backup pull process to verify it is correct and then continue on but wonder if the whole thing (all 4 files are being redownloaded to verify) it's integrity before continuing on.

I do see 4 files associated with a server and they are:client.log.blob, drive-sata0.img.fidx, index.json.blob, and qemu-server.conf.blob. I think that the index.jason.blob has the 'file information' in it and would thing that it's only needed to be re-downloaded and verified, perhaps along with the the checksum - but not sure.

What do you mean by redundancy? I do have several drives in my ZFS datastore for drive failure, if that's what you mean by redundancy.

Regarding the log of errors - Since I'm so far into the current running verify job (on the Site B location) I'm not going to stop the job. The error is coming from Site B - but it, if I remember correctly states an issue with a 'locked file' and unable to access it. Whenever I see this issue, I check to see if a verify is running and that means, after verification - that the verify has the 'data' for a specific vm 'open/locked' due to the verification and the 'sync' process can't be done. This bring me back to the issue in that the 'largest file' in the backup is the drive-xxx.img.fidx file which seems to be re-downloading the data again for it's verification...and with it being 'large' with some of my servers...it can't pull/sync the latest backup to the remote side. All other servers within the 'pull/sync' process happen w/o issues and I just have to rerun the sync again after the 'forever taking' verify job happens at Site B.

I will upload the latest verify job log after it's done. If there is a way to pull/download a verify job from an earlier date (mine seem to overwrite themselves) then please advise and I'll pull 1 and attach to this case.

mark
 
Not pulling backups in at site A.
Site A has 2 servers, PVE and PBS. PVE does the backups onto the shared storage on the PBS backup. Pretty normal.
Site A is also doing a verify of the backups daily to be sure the backups are 'good' (from the PVE environment).
My question was rather if you also backup to side B directly. My intention was to understand were exactly you did see the lockin related error.

Site B also is doing a verify, against the data within the datastore - that was pulled from Site A to be sure the backups are good.
As previously already stated, the sync job already checks integrity on transfer, verifying makes sense if you don't trust the underlying storage to retain you data integrity.

What do you mean by redundancy? I do have several drives in my ZFS datastore for drive failure, if that's what you mean by redundancy.
I mean what ZFS redundancy, so which kind of raid configuration you are using. E.g. you could have multiple vdevs without redundancy in your zpool. But i do assume you have some sort of mirror.

I'd of thought that this process would so the following - re-download the small metafile (associated) with the index file and compare it against the "now" downloaded file from the backup pull process to verify it is correct and then continue on but wonder if the whole thing (all 4 files are being redownloaded to verify) it's integrity before continuing on.

I do see 4 files associated with a server and they are:client.log.blob, drive-sata0.img.fidx, index.json.blob, and qemu-server.conf.blob. I think that the index.jason.blob has the 'file information' in it and would thing that it's only needed to be re-downloaded and verified, perhaps along with the the checksum - but not sure
Well no, all of the files are required to have a consistent backup snapshot for your VM. But not only these are required but also the corresponding chunks located in your datastores .chunks folder. The verify job make sure all of these are present and correct.

This bring me back to the issue in that the 'largest file' in the backup is the drive-xxx.img.fidx file which seems to be re-downloading the data again for it's verification...
No, verification does not download the data from the remote, it reads and checks the datastores chunks of you local PBS datastore. But dependig on you underlying storage, that make take time... Please share more details about the storage used for the datastore on side B. I suspect that it simply does not provide the required IOPS for the verify job.

I will upload the latest verify job log after it's done. If there is a way to pull/download a verify job from an earlier date (mine seem to overwrite themselves) then please advise and I'll pull 1 and attach to this case.
You can select the previous job from the task list and after double clicking download it via the Download button.
 
By all of the files needing to be verified does this mean that if I have a 60GB vm, containing those 4 files plus a meg or 2 for the index/qemu blobs that I have to verify all those files - against my local side? I've checked the 'local' files on Site B and already have all that data, minus what might have chanced since the last pull. You're probably correct with the process checking the 'local' side's data with the actual data which is local (meaning site B having the existing previous verified job against the newly sync'd data on the same side) but it sure takes quite a long time with server files that are larger than 150Gb.

As far as the server config - it is a single proc amd epyc 16core with 128Gb of ram. The underlying hardware consists of 2 ssd's for the o/s and 8 sata 7200 drives in a raidz2 configuration. the cpu rarely goes over 30% and memory hovers around 90gb all the time. I know it's not ssd drives for the storage platform, but since this is all it does - it should be pretty adequate.

Here's the log from the last sync which shows the msg I previously mentioned regarding the 'locked file' and subsequent error on completion. It referrences, in this case group vm/204. '204' is a vm with a 160Gb drive in it. vm 202 is a 60gb so almost 1/3 in size but the verification of it generally is done in about 15 minutes - not many hours.

Here's a couple of other snapshots - the PVE and PBR for reference as well.

PVE -
1723561732703.png

PBR -

1723561796509.png
 

Attachments

  • task-nocpbs-syncjob-2024-08-13T09_00_00Z.log
    7.2 KB · Views: 1
Those jobs take forever... your PBS is virtualised and you're looking at 30% IO delay, what's the underlying hardware (and storage here)?
 
Yes - I know it's virtualized as that was my only choice initially. i'm working on another server build now to, in the future - have a dedicated PBS (storage) environment. The info for the main server at site b and the vm - is pictured above. That said, and even tho it is virtualized, the smaller (less that 100gb) verify jobs run pretty quickly..just the larger files are giving issues.
Since I'm using ZFS on both sides, and it provides redundancy for file I might look into zfs-send as an option.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!