No previous backup found, cannot do incremental backup

Hi Chris,

the Ceph-Cluster is healthy.
For the other 150 VMs the Backup works fine.

Here the requested output:

Code:
root@pve:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.15.143-1-pve: 5.15.143-1
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.131-1-pve: 5.15.131-2
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

Code:
root@pbs:~# proxmox-backup-manager versions --verbose
proxmox-backup                    3.2.0        running kernel: 6.8.4-3-pve
proxmox-backup-server             3.2.6-1      running version: 3.2.4    
proxmox-kernel-helper             8.1.0                                  
proxmox-kernel-6.8                6.8.8-2                                
proxmox-kernel-6.8.8-1-pve-signed 6.8.8-1                                
proxmox-kernel-6.8.4-3-pve-signed 6.8.4-3                                
proxmox-kernel-6.8.4-2-pve-signed 6.8.4-2                                
ifupdown2                         3.2.0-1+pmx8                          
libjs-extjs                       7.0.0-4                                
proxmox-backup-docs               3.2.6-1                                
proxmox-backup-client             3.2.6-1                                
proxmox-mail-forward              0.2.3                                  
proxmox-mini-journalreader        1.4.0                                  
proxmox-offline-mirror-helper     0.6.6                                  
proxmox-widget-toolkit            4.2.3                                  
pve-xtermjs                       5.3.0-3                                
smartmontools                     7.3-pve1                              
zfsutils-linux                    2.2.4-pve1

Thank you & regards
Simon
Thanks for the output, could you also check the header as requested in my previous post (maybe you overlooked this as I edited this only afterwards, sorry if you missed this).
 
Hi Chris,

you are right, I have overlocked it...

Code:
root@pbs:~# ll /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx
-rw-r--r-- 1 backup backup 4128 Jun 24 23:39 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx

Code:
root@pbs:~# head -c8 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx | hexdump
0000000 0000 0000 0000 0000                   
0000008

So I think something is bad.

Regards
Simon
 
Hi Chris,

you are right, I have overlocked it...

Code:
root@pbs:~# ll /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx
-rw-r--r-- 1 backup backup 4128 Jun 24 23:39 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx

Code:
root@pbs:~# head -c8 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx | hexdump
0000000 0000 0000 0000 0000                 
0000008

So I think something is bad.

Regards
Simon
Yes, it seems that this one snapshot got corrupted, but this seems to have happened after the backup. PBS does calculate checksums and digests to verify data integrity during backup.

Therefore, my suggestion is to perform a verification of all the backups currently stored on that datastore, and if other datastores are located on the same storage backend as well, you should verify these as well. By this, it should be possible to exclude that other snapshots are corrupt as well.

Could you also share some more details about how your setup looks like, so to try and reproduce the issue?
E.g. is the Ceph cluster managed by Proxmox VE and if so which version, how did you mount the CephFS on the Proxmox Backup Server side? Is the Proxmox Backup Server a bare metal installation or are you running inside a VM/CT?

Thanks!
 
Hi Chris,

thank you for your fast reply.

I have around 150 VMs on my PVE and only 11 are affected.
Some of them are backed up to other pbs-datastores.
I've tried to remove the complete pbs-datastore and recreated it - same problem with the same vms.
Any other VMs does not have any Problem. The PBS verify-job runs like charm.

I have a external ceph-cluster (not managed by PVE). PBS runs as VM inside PVE and the storage is mounted via CephFS:
:/pbs-backup/ /pbs-backup ceph name=pbs-backup,noatime,_netdev 0 0

Inside /etc/ceph/ceph.client.pbs-backup.keyring the client-key is stored.

I knonw running PBS as VM is not the best-practice way - but it basically works fine.

Regards
Simon
 
I have around 150 VMs on my PVE and only 11 are affected.
Do all of these are located on the CephFS storage backend of your PBS? Do all of these show the same 0 sequence when dumping the index magic number?

I've tried to remove the complete pbs-datastore and recreated it - same problem with the same vms.
Well removing the storage on Proxmox VE side and re-adding it will not affect the snapshots, so this is not surprising. What will help is removing the affected snapshots, so the new backup runs will not try to load the index file found in the previous snapshot. You should however consider if that is what you want.

The PBS verify-job runs like charm.
Are you sure you are verifying all the snapshots? Do the affected snapshots verify without issues? Maybe they where already verified in the past?

I have a external ceph-cluster (not managed by PVE). PBS runs as VM inside PVE and the storage is mounted via CephFS:
:/pbs-backup/ /pbs-backup ceph name=pbs-backup,noatime,_netdev 0 0

Inside /etc/ceph/ceph.client.pbs-backup.keyring the client-key is stored.

I knonw running PBS as VM is not the best-practice way - but it basically works fine.
Thanks for this information. Please do make sure that the data is consistent there and maybe perform a deep scrub. Also, do not use the noatime flag, garbage collection requires atime updates, see https://pbs.proxmox.com/docs/backup-client.html#garbage-collection
 
I have around 150 VMs on my PVE and only 11 are affected.
Is there anything special about these VMs that the other VMs don't have?
 
  • Like
Reactions: Chris
Do all of these are located on the CephFS storage backend of your PBS? Do all of these show the same 0 sequence when dumping the index magic number?
Yes, all of them are located on the CephFS, and no only the 11 VMs have the problem with the index magix number.

Well removing the storage on Proxmox VE side and re-adding it will not affect the snapshots, so this is not surprising. What will help is removing the affected snapshots, so the new backup runs will not try to load the index file found in the previous snapshot. You should however consider if that is what you want.
Sorry, I expressed myself incorrectly. I've removed the complete backup-store from PVE, from PBS and Ceph. Then I've created a new CephFS-Share, mounted the new share on the PBS, initialized a new pbs-datastore and integrated the new backup-store in PVE.
I've also tried to remove the snapshots with the index-error. then some backups are working again - sometimes.

Are you sure you are verifying all the snapshots? Do the affected snapshots verify without issues? Maybe they where already verified in the past?
I'm pretty shure, I'm verify all the snapshots. Only the snapshots of the view VMs have the problem.

Thanks for this information. Please do make sure that the data is consistent there and maybe perform a deep scrub. Also, do not use the noatime flag, garbage collection requires atime updates, see https://pbs.proxmox.com/docs/backup-client.html#garbage-collection
Thank you, I've changed the mount-flag. ceph tells me, that all data are consistent and a deep scrub runs periodically.

Kind regards
Simon
 
Sorry, I expressed myself incorrectly. I've removed the complete backup-store from PVE, from PBS and Ceph. Then I've created a new CephFS-Share, mounted the new share on the PBS, initialized a new pbs-datastore and integrated the new backup-store in PVE.
I've also tried to remove the snapshots with the index-error. then some backups are working again - sometimes.
So you are getting a different error then? If your datastore is empty, then there should be no issues with a previous snapshot containing a corrupt index.
 
So you are getting a different error then? If your datastore is empty, then there should be no issues with a previous snapshot containing a corrupt index.
The first Backup will be created, and PVE and PBS says everything is fine.
The next backup fails with the error.

Kind Regards
Simon
 
The first Backup will be created, and PVE and PBS says everything is fine.
The next backup fails with the error.
Is it always the same disk for which it will fail? I.e. is it always the EFI disk for VM 168 that gets the got unknown magic number error? What about the other affected VMs? What disks are problematic for them? Anything in common between those virtual disks?
 
In addition to the questions asked by @fiona: could you also try to exclude that this is related to the CephFS storage by creating and backing up to a PBS local datastore, since you seem to be able to easily reproduce the issue.
Unfortunately I did not manage to reproduce the issue so far on my side, even with a datastore located on CephFS.
 
Hi Chris,
hi Fiona,

thank you for your reply.
I've checked the Logs right now: The affected disks are mostly EFI-Disks, but for some VMs also normal data-disks.
I'll check if the issue appers on local datastore too.

Regards
Simon
 
Hi Chris,

I've tried the following:
- Shutdown VM, Remove EFI-Disk, Create new EFI-Disk, Create Backup: Same Problem
- Add a new local Datastore to PBS, Create Backup of affected VM: Backup works.

So I guess a correlation between the disks and CephFS.
Any further ideas?
As I wrote before: Ceph is healthy, no errors and other VMs and backups could be created and are consistent.

Regards
Simon
 
Hi Chris,

I've tried the following:
- Shutdown VM, Remove EFI-Disk, Create new EFI-Disk, Create Backup: Same Problem
- Add a new local Datastore to PBS, Create Backup of affected VM: Backup works.

So I guess a correlation between the disks and CephFS.
Any further ideas?
As I wrote before: Ceph is healthy, no errors and other VMs and backups could be created and are consistent.

Regards
Simon
Hi,
thanks for the continuous feedback!

After looking at the code more closely without finding any issues and since this cannot be reproduced on the local datastore, this sounds more like an issue/possible bug in CephFS.

Edit: As @fabian pointed out to me there is this upstream fix https://git.kernel.org/pub/scm/linu...h?id=b372e96bd0a32729d55d27f613c8bc80708a82e1, which could be related (although it might be a long shot).
Could you nevertheless try to boot with a more recent kernel for the PBS instance and also use a recent kernel version in your Ceph cluster, e.g. the latest 6.8.8 kernel available in the no-subscription repository?

Also please dump the full first 4k of a corrupt fixed index file and send it to me? Best in a private message, as it could potentially contain sensitive information. You can dump this via head -c4096 <path-to-fixed-image>.img.fidx > fixed-index-header.dump.

Could you provide more information about the Ceph cluster you are using, as well as the client side on PBS?
 
Last edited:
Hi Chris,

after upgrading to Kernel 6.8.8-2 the problem has disappeared for some more VMs.

Also please dump the full first 4k of a corrupt fixed index file and send it to me? Best in a private message, as it could potentially contain sensitive information. You can dump this via head -c4096 <path-to-fixed-image>.img.fidx > fixed-index-header.dump.
I think you mean with fixed index file, the index file on the local pbs store?

Could you provide more information about the Ceph cluster you are using, as well as the client side on PBS?
Could you tell me the Information you need to know? Than I'll provide them.

Thank you & best regards
Simon
 
after upgrading to Kernel 6.8.8-2 the problem has disappeared for some more VMs.
Hmm, should have been fixed for all of them if the above mentioned issue was the cause? (I assume you got rid of already corrupted snapshots).

I think you mean with fixed index file, the index file on the local pbs store?
No, I mean the corrupt fixed index file for the disk in the snapshot, just like in https://forum.proxmox.com/threads/n...annot-do-incremental-backup.78340/post-678151 but this time the full 4k header dump, not just the first few bytes.

Could you tell me the Information you need to know? Than I'll provide them.
Well, Ceph version and running kernel, used OS and settings for the CephFS might be useful to reproduce and/or narrow down the issue. Also, the used client version on the PBS side and mount parameters (which you already provided).
 
Hi Chris,

on the PBS (Ceph-Client) was 16.x installed. On the Ceph-Server ist Ceph 18.x installed.
I've done a Upgrade to 18.x on the PBS.
Fore some more VMs the backup works now fine - but not for all vms.
I'll check the next days.

Best regards
Simon
 
Hi Chris,

on the PBS (Ceph-Client) was 16.x installed. On the Ceph-Server ist Ceph 18.x installed.
I've done a Upgrade to 18.x on the PBS.
Fore some more VMs the backup works now fine - but not for all vms.
I'll check the next days.

Best regards
Simon
Thanks for the provided input, so far I did not reproduce the issue on my side.

Please keep us posted if the issue persists even after the upgrades.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!