Help - Snapshot restore fails with short vma error

sbs

Member
Mar 30, 2021
15
0
6
50
Hi,

I needed to restore a VM today, and to my surprise the restore failed, with a short vma error
Code:
_30-00_07_13.vma.zst : Decoding error (36) : Restored data doesn't match checksum
vma: restore failed - short vma extent (608768 < 655872)

My problem is that I have several snapshots, and they all seem to fail in the same way.

I have tried all of the backups. I have tried to restore the snapshot to a new vm on the same node, or a new vm on another node, but i always get that short vma error.

It seems surprising they all fail with the same error, despite a backup log showing no error.

The backup points out that the VM is running and uses snapshot that will not be backed up.

FWIW the log looks like this :
Code:
2024-06-23 00:07:17 INFO: Starting Backup of VM 112 (qemu)
2024-06-23 00:07:17 INFO: status = running
2024-06-23 00:07:17 INFO: VM Name: VM-TEST-WIN10-1
2024-06-23 00:07:17 INFO: include disk 'ide0' 'local-lvm:vm-112-disk-0' 75G
2024-06-23 00:07:17 INFO: backup mode: snapshot
2024-06-23 00:07:17 INFO: ionice priority: 7
2024-06-23 00:07:17 INFO: skip unused drive 'lvm_n1:vm-112-disk-0' (not included into backup)
2024-06-23 00:07:17 INFO: snapshots found (not included into backup)
2024-06-23 00:07:17 INFO: creating vzdump archive '/mnt/pve/DS2_VMBackup/dump/vzdump-qemu-112-2024_06_23-00_07_17.vma.zst'
2024-06-23 00:07:17 INFO: issuing guest-agent 'fs-freeze' command
2024-06-23 00:07:19 INFO: issuing guest-agent 'fs-thaw' command
2024-06-23 00:07:19 INFO: started backup task '171369dd-fcd6-47b8-811b-4e2bcbfd95fc'
2024-06-23 00:07:19 INFO: resuming VM again
2024-06-23 00:07:22 INFO:   0% (739.5 MiB of 75.0 GiB) in 3s, read: 246.5 MiB/s, write: 235.7 MiB/s
2024-06-23 00:07:25 INFO:   1% (1.3 GiB of 75.0 GiB) in 6s, read: 211.5 MiB/s, write: 210.2 MiB/s
2024-06-23 00:07:28 INFO:   2% (1.9 GiB of 75.0 GiB) in 9s, read: 204.2 MiB/s, write: 197.3 MiB/s
2024-06-23 00:07:31 INFO:   3% (2.5 GiB of 75.0 GiB) in 12s, read: 192.1 MiB/s, write: 191.6 MiB/s
2024-06-23 00:07:34 INFO:   4% (3.3 GiB of 75.0 GiB) in 15s, read: 268.2 MiB/s, write: 265.1 MiB/s
2024-06-23 00:07:37 INFO:   5% (3.9 GiB of 75.0 GiB) in 18s, read: 222.3 MiB/s, write: 221.2 MiB/s
2024-06-23 00:07:4 <SNIP>
                                                                          </SNIP>GiB) in 4m 24s, read: 267.0 MiB/s, write: 267.0 MiB/s
2024-06-23 00:11:47 INFO:  77% (57.8 GiB of 75.0 GiB) in 4m 28s, read: 184.9 MiB/s, write: 118.5 MiB/s
2024-06-23 00:11:50 INFO:  78% (59.2 GiB of 75.0 GiB) in 4m 31s, read: 461.6 MiB/s, write: 393.9 MiB/s
2024-06-23 00:11:53 INFO:  99% (74.8 GiB of 75.0 GiB) in 4m 34s, read: 5.2 GiB/s, write: 172.3 MiB/s
2024-06-23 00:11:55 INFO: 100% (75.0 GiB of 75.0 GiB) in 4m 36s, read: 80.1 MiB/s, write: 79.1 MiB/s
2024-06-23 00:11:55 INFO: backup is sparse: 17.63 GiB (23%) total zero data
2024-06-23 00:11:55 INFO: transferred 75.00 GiB in 276 seconds (278.3 MiB/s)
2024-06-23 00:13:06 INFO: archive file size: 33.81GB
2024-06-23 00:13:06 INFO: prune older backups with retention: keep-last=5
2024-06-23 00:13:06 INFO: pruned 0 backup(s)
2024-06-23 00:13:06 INFO: Finished Backup of VM 112 (00:05:49)
Is there something I am missing in this case?
 
the decoding error is likely from zstd and would imply a corrupt file.. I assume the storage in question is some sort of NAS? I'd check its disks and memory health..
 
Ok, I can do that. The snapshots are stored on a brand new NAS with raid5 BTRFS. I would not expect file corruption on 4 different snapshots.

This mean that my backup are corrupted despite reporting success.
I guess this means :
- the VM is lost,
- I should check all my VM backups?

Is there command or a GUI tool to check snapshot integrity?
Is there a command to restore bypassing integrity checks? WIth a little bit of luck, the corrupted disk image will boot and can be repaired from the guest OS.

Regards,
 
if the error is systematic, I'd assume some kind of problem with flushing/cache coherency to be honest (if the disks/memory looks healthy otherwise).

you can use `vma verify` to check other backups (you need to pipe the decompressed file into it, and most likely if other backups are affected the decompression will already fail).

Is there a command to restore bypassing integrity checks? WIth a little bit of luck, the corrupted disk image will boot and can be repaired from the guest OS.

I don't think there is, but in this case probably the vma file is truncated, so it would just need to be patched to not abort in that case..

https://git.proxmox.com/?p=qemu-ser...adda8b6d66287b964744216c028c612;hb=HEAD#l7623

if you comment out this line here (in /usr/share/perl5/PVE/QemuServer.pm) a failing restore from a VMA file should leave the partially restored disk image around. "apt install --reinstall qemu-server" will restore the stock version if you mess up ;)
 
Hi,

For information, I have extracted the raw file from the zst despite being to short. Renamed appropraitely, copied to and image storage and run a qm rescan.

raw disk image appeared in my vm, and after reattachment successfully booted. As far as I can see my VM now runs. But I lost the snapshots (which do not seem to be backed up).

I still need to inquire if all my backups are wrong or just the ones for that machine.
 
snapshots are never part of the backup, that is kind of orthogonal to your issue ;)
 
@fabian : Is there a doc/manual that explains how to properly work with backups & snapshots. (as well as how to manage snapshot when moving VMs around).

We use snapshot for CI and automated testing. However, not having restorable (*) backups or not being able to move the VMs seems like an issue to us.

For now we will try to :
- Setup the VM fully installed ready for its CI/Test operations
- Make a protected backup
- Make a snapshot
- Run CI/Test operations, restoring to snapshot before each new cycle

If something bad happens, restore protected backup, recreate snapshot and restart CI/Test operations.


(*) not restorable, I mean that restoring the backup will not set the VM in it's backup state including the snapshots
 
that approach sounds sensible. snapshots are basically meant for that purpose - you take one before you do some potentially destructive operation inside the guest, so that you can rollback if need be.

backups are for long-term archival.

those are two very different/orthogonal purposes. including snapshots in the backup is not really possible (snapshots are done on the storage level, backups by qemu in the case of VMs) and would blow up the backup sizes anyway.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!