Verifying a snapshot take 4 or 5 times longer than backup

Kodey

Member
Oct 26, 2021
107
4
23
I have 1 snapshot on my device which takes 4.21 hour on zfs and 4.30 hours on xfs.
In both cases verify of that single snapshot takes over 20 hours on zfs and 17 on xfs.
Is this normal, how is this possible and is there any way to improve this setup?

Also, even opening the log in the task viewer takes several minutes during which the browser becomes unresponsive until the log is displayed.
This highlights 2 things about the logs. 1 is that there is excessive unnecessary logging, and 2. that the web page which displays the log is synchronous which is sub-optimal for displaying large quantities of data. There's also the possibility that logging introduces lag into the backup process.
 
Hi,
I have 1 snapshot on my device which takes 4.21 hour on zfs and 4.30 hours on xfs.
In both cases verify of that single snapshot takes over 20 hours on zfs and 17 on xfs.
Is this normal, how is this possible and is there any way to improve this setup?
verification needs to read all the data and compute checksums. That is a very expensive operation.
Also, even opening the log in the task viewer takes several minutes during which the browser becomes unresponsive until the log is displayed.
This highlights 2 things about the logs. 1 is that there is excessive unnecessary logging, and 2. that the web page which displays the log is synchronous which is sub-optimal for displaying large quantities of data. There's also the possibility that logging introduces lag into the backup process.
How does the load on the server look like during verification? Maybe it's just too overloaded by that.

What kind of physical disks do you have?
 
verification needs to read all the data and compute checksums. That is a very expensive operation.
Read operations are normally twice as fast as writes. This is much slower, so slow in fact as to make verification impractical.
Is this really to be expected? 4 to 5 times slower than writing the backup.
How does the load on the server look like during verification? Maybe it's just too overloaded by that.
Load looks fine. I'm seeing averages of 10% IO; 5% cpu and ram at 60% while 2 vms and 2 appliances are running.

What kind of physical disks do you have?
These are 14TB usb 3.0 disks. I know they're slow, but I'm really only asking about the difference between backup and verification times which shouldn't be dependent on the media per se.

I'm thinking either my tuning is horribly wrong or there is some kind of bug in the verify process.
Either way, there must be something I can do about it. Can you help?
 
Read operations are normally twice as fast as writes. This is much slower, so slow in fact as to make verification impractical.
Is this really to be expected? 4 to 5 times slower than writing the backup.
Computing the checksum is usually the expensive part. And for backup, only new chunks need to be written, for verify, everything needs to be read and checked. You can also use proxmox-backup-client benchmark --repository <your repository> to get some idea of how fast the different operations are.

Load looks fine. I'm seeing averages of 10% IO; 5% cpu and ram at 60% while 2 vms and 2 appliances are running.
On the PBS server?
 
Computing the checksum is usually the expensive part. And for backup, only new chunks need to be written, for verify, everything needs to be read and checked. You can also use proxmox-backup-client benchmark --repository <your repository> to get some idea of how fast the different operations are.
Code:
Uploaded 1090 chunks in 5 seconds.
Time per request: 4595 microseconds.
TLS speed: 912.71 MB/s
SHA256 speed: 1874.68 MB/s
Compression speed: 537.08 MB/s
Decompress speed: 634.02 MB/s
AES256/GCM speed: 1390.32 MB/s
Verify speed: 474.38 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 912.71 MB/s (74%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1874.68 MB/s (93%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 537.08 MB/s (71%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 634.02 MB/s (53%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 474.38 MB/s (63%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1390.32 MB/s (38%) │
└───────────────────────────────────┴────────────────────┘

At 474.38MB/s verify is 0.000474TB/s -> 2109.7 seconds per TB = 35.16 minutes per TB.
For 4.2TB Should verify in 2.46 hours.
But mine takes 20+ hours.
The xfs usb has very similar results.

With 10% IO; 5% cpu, how can I identify the bottleneck?
 
I've started again and I see it takes 19 hours for the initial snapshot and only 4.2 hours for subsequent snapshots.
The trouble is that it always takes 21 hours to verify snapshots and I'm wondering why it doesn't skip already verified chunks the same way it skips backing up identical chunks?

For example, if a storage contains a 2 snapshots of the same vm and the first one has already been successfully verified, the scheduled verification should only verify those chunks that have changed since the first snapshot verification started.
Instead, it's verifying the whole second snapshot again.

Maybe there could be an option to reverify all or just chunks that haven't been verified since the Re-Verify after days or since some input option for manual verification.
 
I've started again and I see it takes 19 hours for the initial snapshot and only 4.2 hours for subsequent snapshots.
The trouble is that it always takes 21 hours to verify snapshots and I'm wondering why it doesn't skip already verified chunks the same way it skips backing up identical chunks?
Because then you can't say that a snapshot has been verified at a given point in time. Because only part of it was.

For example, if a storage contains a 2 snapshots of the same vm and the first one has already been successfully verified, the scheduled verification should only verify those chunks that have changed since the first snapshot verification started.
Instead, it's verifying the whole second snapshot again.
Are these two different verification tasks? Then they do not keep track of what the other already checked.

Maybe there could be an option to reverify all or just chunks that haven't been verified since the Re-Verify after days or since some input option for manual verification.
In principle yes, but it might go a bit against design decisions. Having a verified snapshot means that all chunks of that snapshot were checked (at a given time). You can already configure after how much time a snapshot is re-verified. If you start only re-checking chunks after a given time, then you can't attach a date to when the snapshot was last verified.
 
you can't attach a date to when the snapshot was last verified.
True, I see the dilemma. The specific goal of attaching a verified date to a snapshot is good, but I'd sacrifice it in a flash for the ability to drastically reduce verify times. The downside seems small.
I know it's a paradigm shift, but it's a worthy achievement if you can guarantee all the chunks have been verified in the last 30 days or whatever your 'Re-Verify after days' is set to.
The amount of resources consumed in continuously reverifying the same chunks each time seems a bit of overkill considering that an image that hasn't been verified for 30 days may only have a few chunks which weren't verified recently.
All that extra activity causes more wear on the disk and power consumption too.

Maybe there could be an option to reverify all chunks of an image/snapshot or only those unverified in the last 'Re-Verify after days'?
 
AFAIU, you verify new backups directly or have a very frequent verification job schedule. If you reduce the schedule frequency, so that each verify job only runs after multiple new backups were taken, you can avoid most of the duplicate work. Within a given verify job, even if multiple snapshots reference the same chunk, it should only be checked once.

You can open a feature request on the bug tracker if you want. If enough people are interested, it can be evaluated more closely then: https://bugzilla.proxmox.com/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!