Backup suceeds but ends up failing verification

pallingerpeter

New Member
Jan 17, 2024
4
0
1
For all of my VMs, backups always succeed.
For some of them however, the result is bad:
  • they fail verification
  • the first stored disk cannot be downloaded, the http response is:
    • unable to read fixed index <filename> - Unable to open fixed index <filename> - got unknown magic number
  • of course, the file restire function also gives an error:
    • Starting VM failed. See output above for more information. (500)
    • I failed to find any further information in the logs (or in the response).
The error does occur for some of the vms, and even then, not completely consistently. The only common thing about them is that all such VMs have at least two disks.
Other observations:
  • Repeated backups after a bad result usually (but not always) also create a bad result.
  • Repeated backups after a good result (even if there was a bad result before that) usually (but not always) also create a good result.
  • Migrating a VM to another node and then doing a new backup sometimes creates a good result.
  • Setting up replication (and waiting for it to finish) for a VM and then doing a new backup usually creates a good result.
I cannot give a complete report yet as re-doing backups takes about 20minutes + 60minutes of verification (the VMs are about 200GiB, there were some smaller ones but they decided to fix themselves after a manual test backup and are producing good backups ever since :/ ), .

Has anyone seen something similar?
If yes, what may be the problem?
If no, does anyone know how to debug a backup or verification process?
 
For some more information here is the output for a verification job:
Code:
2024-06-19T12:46:54+02:00: Automatically verifying newly added snapshot
2024-06-19T12:46:54+02:00: verify backup-cephfs:vm/440/2024-06-19T10:32:19Z
2024-06-19T12:46:54+02:00:   check qemu-server.conf.blob
2024-06-19T12:46:54+02:00:   check drive-virtio1.img.fidx
2024-06-19T13:18:24+02:00:   verified 62320.32/194312.00 MiB in 1890.00 seconds, speed 32.97/102.81 MiB/s (0 errors)
2024-06-19T13:18:24+02:00:   check drive-virtio0.img.fidx
2024-06-19T13:18:24+02:00: verify backup-cephfs:vm/440/2024-06-19T10:32:19Z/drive-virtio0.img.fidx failed: Unable to open fixed index "/mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10:32:19Z/drive-virtio0.img.fidx" - got unknown magic number
2024-06-19T13:18:24+02:00: TASK ERROR: verification failed - please check the log for details

I can send drive-virtio0.img.fidx if anyone is interested, it really starts with 32 zero bytes, which is most probably not a magic number.
Code:
head -c100 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx | hexdump
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000020 1856 2b7b b8cc 033f 47b9 c406 5c4b 0633
0000030 6e10 0427 fa5d 07cc 720e 7836 9c66 661d
0000040 0000 0000 0000 0000 0000 0000 0000 0000
*
0000060 0000 0000                              
0000064
 
please post the output of od --format=u1 --read-bytes=8 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx

The output should match exactly 0000000 47 127 65 237 145 253 15 205

if not (and it seems like) your fixed index is currupted or something else than a fixed index. the magic number is a 8-byte long identifier of the file type.

Does the time stamp of your fixed index match the date/time when the backup was created?
 
Last edited:
First of all, thank you for replying!
please post the output of od --format=u1 --read-bytes=8 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx
Code:
0000000   0   0   0   0   0   0   0   0
0000010
if not (and it seems like) your fixed index is currupted or something else than a fixed index. the magic number is a 8-byte long identifier of the file type.
Yeah, I get that. The question remains: why is it generated so, and how can I avoid that? How can I debug the backup process?
Does the time stamp of your fixed index match the date/time when the backup was created?
Code:
ls -la /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/
total 1592
drwxr-xr-x  2 backup backup       5 Jun 19 13:18 .
drwxr-xr-x 26 backup backup      25 Jun 19 12:32 ..
-rw-r--r--  1 backup backup    2369 Jun 19 12:46 client.log.blob
-rw-r--r--  1 backup backup   20992 Jun 19 12:46 drive-virtio0.img.fidx
-rw-r--r--  1 backup backup 1604096 Jun 19 12:46 drive-virtio1.img.fidx
-rw-r--r--  1 backup backup     583 Jun 19 13:18 index.json.blob
-rw-r--r--  1 backup backup     536 Jun 19 12:32 qemu-server.conf.blob
Yes, the fidx date matches the final time found in the client log.
 
Last edited:
Update: I ran a lot of new backups, and observed the happenings in the backup metadata directory with the following script:

Code:
while sleep 10; do for i in 0 1 ; do od --format=u
1 drive-virtio$i*fidx| head -n1 ; od --format=u1 drive-virtio$i*fidx| wc -l ; done; done
This prints the first 16 bytes (which contain the magic number) and the approximate number of non-zero 16-byte chunks for both fidx files every 10 seconds.

In the beginning, both drive-virtio0.img.tmp_fidx and drive-virtio1.img.tmp_fidx were created, with a good magic number and not much else (qemu-server.conf.blob was also created, but that is not relevant):
Then drive-virtio1.img.tmp_fidx was being filled up with non-zeroes as virtio1 was being backed up according to the server job log, roughly the same percent being non-zero as the disk's completion according to the log. The output is like this:
Code:
0000000  47 127  65 237 145 253  15 205  55 172 206  80 104 185  68 104
8
0000000  47 127  65 237 145 253  15 205 151 195   6 194 119 143  77 230
54286

Then at the end, drive-virtio0.img.tmp_fidx is filled up with data. Unfortunately, both of my bad-backup VMs decided that they backup nicely if I observe them (Scrödinger's backup :p ). I will continue to try to reproduce the error while observing...
 
Last edited:
Hi,
please provide the output of pveversion -v for the Proxmox VE nodes and the proxmox-backup-manager versions --verbose of the Proxmox Backup Server host.

These intermitten type of errors might indicate an underlying hardware issue. So please also check the disks health of those backing the storage on the PBS side as well as the system memory by performing a prolonged memory test. What storage disks are you using?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!