Verifying a snapshot take 4 or 5 times longer than backup

Kodey · Oct 19, 2023

I have 1 snapshot on my device which takes 4.21 hour on zfs and 4.30 hours on xfs.
In both cases verify of that single snapshot takes over 20 hours on zfs and 17 on xfs.
Is this normal, how is this possible and is there any way to improve this setup?

Also, even opening the log in the task viewer takes several minutes during which the browser becomes unresponsive until the log is displayed.
This highlights 2 things about the logs. 1 is that there is excessive unnecessary logging, and 2. that the web page which displays the log is synchronous which is sub-optimal for displaying large quantities of data. There's also the possibility that logging introduces lag into the backup process.

fiona · Oct 19, 2023

Hi,

Kodey said:
I have 1 snapshot on my device which takes 4.21 hour on zfs and 4.30 hours on xfs.
In both cases verify of that single snapshot takes over 20 hours on zfs and 17 on xfs.
Is this normal, how is this possible and is there any way to improve this setup?

verification needs to read all the data and compute checksums. That is a very expensive operation.

Kodey said:
Also, even opening the log in the task viewer takes several minutes during which the browser becomes unresponsive until the log is displayed.
This highlights 2 things about the logs. 1 is that there is excessive unnecessary logging, and 2. that the web page which displays the log is synchronous which is sub-optimal for displaying large quantities of data. There's also the possibility that logging introduces lag into the backup process.

How does the load on the server look like during verification? Maybe it's just too overloaded by that.

What kind of physical disks do you have?

Kodey · Oct 19, 2023

fiona said:
verification needs to read all the data and compute checksums. That is a very expensive operation.

Read operations are normally twice as fast as writes. This is much slower, so slow in fact as to make verification impractical.
Is this really to be expected? 4 to 5 times slower than writing the backup.

fiona said:
How does the load on the server look like during verification? Maybe it's just too overloaded by that.

Load looks fine. I'm seeing averages of 10% IO; 5% cpu and ram at 60% while 2 vms and 2 appliances are running.

fiona said:
What kind of physical disks do you have?

These are 14TB usb 3.0 disks. I know they're slow, but I'm really only asking about the difference between backup and verification times which shouldn't be dependent on the media per se.

I'm thinking either my tuning is horribly wrong or there is some kind of bug in the verify process.
Either way, there must be something I can do about it. Can you help?

fiona · Oct 20, 2023

Kodey said:
Read operations are normally twice as fast as writes. This is much slower, so slow in fact as to make verification impractical.
Is this really to be expected? 4 to 5 times slower than writing the backup.

Computing the checksum is usually the expensive part. And for backup, only new chunks need to be written, for verify, everything needs to be read and checked. You can also use proxmox-backup-client benchmark --repository <your repository> to get some idea of how fast the different operations are.

Kodey said:
Load looks fine. I'm seeing averages of 10% IO; 5% cpu and ram at 60% while 2 vms and 2 appliances are running.

On the PBS server?

Kodey · Oct 20, 2023

fiona said:
On the PBS server?

Yes. They run side by side

Kodey · Oct 22, 2023

fiona said:
Computing the checksum is usually the expensive part. And for backup, only new chunks need to be written, for verify, everything needs to be read and checked. You can also use proxmox-backup-client benchmark --repository <your repository> to get some idea of how fast the different operations are.

Code:

Uploaded 1090 chunks in 5 seconds.
Time per request: 4595 microseconds.
TLS speed: 912.71 MB/s
SHA256 speed: 1874.68 MB/s
Compression speed: 537.08 MB/s
Decompress speed: 634.02 MB/s
AES256/GCM speed: 1390.32 MB/s
Verify speed: 474.38 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 912.71 MB/s (74%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1874.68 MB/s (93%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 537.08 MB/s (71%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 634.02 MB/s (53%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 474.38 MB/s (63%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1390.32 MB/s (38%) │
└───────────────────────────────────┴────────────────────┘

At 474.38MB/s verify is 0.000474TB/s -> 2109.7 seconds per TB = 35.16 minutes per TB.
For 4.2TB Should verify in 2.46 hours.
But mine takes 20+ hours.
The xfs usb has very similar results.

With 10% IO; 5% cpu, how can I identify the bottleneck?

dMopp · Oct 22, 2023

I guess this is an USB issue for some reason I can’t tell.

Kodey · Nov 1, 2023

I've started again and I see it takes 19 hours for the initial snapshot and only 4.2 hours for subsequent snapshots.
The trouble is that it always takes 21 hours to verify snapshots and I'm wondering why it doesn't skip already verified chunks the same way it skips backing up identical chunks?

For example, if a storage contains a 2 snapshots of the same vm and the first one has already been successfully verified, the scheduled verification should only verify those chunks that have changed since the first snapshot verification started.
Instead, it's verifying the whole second snapshot again.

Maybe there could be an option to reverify all or just chunks that haven't been verified since the Re-Verify after days or since some input option for manual verification.

fiona · Nov 2, 2023

Kodey said:
I've started again and I see it takes 19 hours for the initial snapshot and only 4.2 hours for subsequent snapshots.
The trouble is that it always takes 21 hours to verify snapshots and I'm wondering why it doesn't skip already verified chunks the same way it skips backing up identical chunks?

Because then you can't say that a snapshot has been verified at a given point in time. Because only part of it was.

Kodey said:
For example, if a storage contains a 2 snapshots of the same vm and the first one has already been successfully verified, the scheduled verification should only verify those chunks that have changed since the first snapshot verification started.
Instead, it's verifying the whole second snapshot again.

Are these two different verification tasks? Then they do not keep track of what the other already checked.

Kodey said:
Maybe there could be an option to reverify all or just chunks that haven't been verified since the Re-Verify after days or since some input option for manual verification.

In principle yes, but it might go a bit against design decisions. Having a verified snapshot means that all chunks of that snapshot were checked (at a given time). You can already configure after how much time a snapshot is re-verified. If you start only re-checking chunks after a given time, then you can't attach a date to when the snapshot was last verified.

Kodey · Nov 2, 2023

fiona said:
you can't attach a date to when the snapshot was last verified.

True, I see the dilemma. The specific goal of attaching a verified date to a snapshot is good, but I'd sacrifice it in a flash for the ability to drastically reduce verify times. The downside seems small.
I know it's a paradigm shift, but it's a worthy achievement if you can guarantee all the chunks have been verified in the last 30 days or whatever your 'Re-Verify after days' is set to.
The amount of resources consumed in continuously reverifying the same chunks each time seems a bit of overkill considering that an image that hasn't been verified for 30 days may only have a few chunks which weren't verified recently.
All that extra activity causes more wear on the disk and power consumption too.

Maybe there could be an option to reverify all chunks of an image/snapshot or only those unverified in the last 'Re-Verify after days'?

fiona · Nov 3, 2023

AFAIU, you verify new backups directly or have a very frequent verification job schedule. If you reduce the schedule frequency, so that each verify job only runs after multiple new backups were taken, you can avoid most of the duplicate work. Within a given verify job, even if multiple snapshots reference the same chunk, it should only be checked once.

You can open a feature request on the bug tracker if you want. If enough people are interested, it can be evaluated more closely then: https://bugzilla.proxmox.com/

Kodey · Nov 9, 2023

https://bugzilla.proxmox.com/show_bug.cgi?id=5035

Verifying a snapshot take 4 or 5 times longer than backup

Kodey

Active Member

fiona

Proxmox Staff Member

Kodey

Active Member

fiona

Proxmox Staff Member

Kodey

Active Member

Kodey

Active Member

dMopp

Member

Kodey

Active Member

fiona

Proxmox Staff Member

Kodey

Active Member

fiona

Proxmox Staff Member

Kodey

Active Member

We value your privacy