No previous backup found, cannot do incremental backup

Chris · Jun 26, 2024

Hi Chris,

the Ceph-Cluster is healthy.
For the other 150 VMs the Backup works fine.

Here the requested output:

Code:

root@pve:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.15.143-1-pve: 5.15.143-1
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.131-1-pve: 5.15.131-2
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

Code:

root@pbs:~# proxmox-backup-manager versions --verbose
proxmox-backup                    3.2.0        running kernel: 6.8.4-3-pve
proxmox-backup-server             3.2.6-1      running version: 3.2.4    
proxmox-kernel-helper             8.1.0                                  
proxmox-kernel-6.8                6.8.8-2                                
proxmox-kernel-6.8.8-1-pve-signed 6.8.8-1                                
proxmox-kernel-6.8.4-3-pve-signed 6.8.4-3                                
proxmox-kernel-6.8.4-2-pve-signed 6.8.4-2                                
ifupdown2                         3.2.0-1+pmx8                          
libjs-extjs                       7.0.0-4                                
proxmox-backup-docs               3.2.6-1                                
proxmox-backup-client             3.2.6-1                                
proxmox-mail-forward              0.2.3                                  
proxmox-mini-journalreader        1.4.0                                  
proxmox-offline-mirror-helper     0.6.6                                  
proxmox-widget-toolkit            4.2.3                                  
pve-xtermjs                       5.3.0-3                                
smartmontools                     7.3-pve1                              
zfsutils-linux                    2.2.4-pve1

Thank you & regards
Simon

Thanks for the output, could you also check the header as requested in my previous post (maybe you overlooked this as I edited this only afterwards, sorry if you missed this).

SiJux · Jun 26, 2024

Hi Chris,

you are right, I have overlocked it...

Code:

root@pbs:~# ll /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx
-rw-r--r-- 1 backup backup 4128 Jun 24 23:39 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx

Code:

root@pbs:~# head -c8 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx | hexdump
0000000 0000 0000 0000 0000                   
0000008

So I think something is bad.

Regards
Simon

Chris · Jun 26, 2024

SiJux said:

Hi Chris,

you are right, I have overlocked it...

Code:

root@pbs:~# ll /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx
-rw-r--r-- 1 backup backup 4128 Jun 24 23:39 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx

Code:

root@pbs:~# head -c8 /pbs-backup/vm/168/2024-06-24T21:02:37Z/drive-efidisk0.img.fidx | hexdump
0000000 0000 0000 0000 0000                 
0000008

So I think something is bad.

Regards
Simon

Yes, it seems that this one snapshot got corrupted, but this seems to have happened after the backup. PBS does calculate checksums and digests to verify data integrity during backup.

Therefore, my suggestion is to perform a verification of all the backups currently stored on that datastore, and if other datastores are located on the same storage backend as well, you should verify these as well. By this, it should be possible to exclude that other snapshots are corrupt as well.

Could you also share some more details about how your setup looks like, so to try and reproduce the issue?
E.g. is the Ceph cluster managed by Proxmox VE and if so which version, how did you mount the CephFS on the Proxmox Backup Server side? Is the Proxmox Backup Server a bare metal installation or are you running inside a VM/CT?

Thanks!

SiJux · Jun 26, 2024

Hi Chris,

thank you for your fast reply.

I have around 150 VMs on my PVE and only 11 are affected.
Some of them are backed up to other pbs-datastores.
I've tried to remove the complete pbs-datastore and recreated it - same problem with the same vms.
Any other VMs does not have any Problem. The PBS verify-job runs like charm.

I have a external ceph-cluster (not managed by PVE). PBS runs as VM inside PVE and the storage is mounted via CephFS:
:/pbs-backup/ /pbs-backup ceph name=pbs-backup,noatime,_netdev 0 0

Inside /etc/ceph/ceph.client.pbs-backup.keyring the client-key is stored.

I knonw running PBS as VM is not the best-practice way - but it basically works fine.

Regards
Simon

Chris · Jun 26, 2024

SiJux said:
I have around 150 VMs on my PVE and only 11 are affected.

Do all of these are located on the CephFS storage backend of your PBS? Do all of these show the same 0 sequence when dumping the index magic number?

SiJux said:
I've tried to remove the complete pbs-datastore and recreated it - same problem with the same vms.

Well removing the storage on Proxmox VE side and re-adding it will not affect the snapshots, so this is not surprising. What will help is removing the affected snapshots, so the new backup runs will not try to load the index file found in the previous snapshot. You should however consider if that is what you want.

SiJux said:
The PBS verify-job runs like charm.

Are you sure you are verifying all the snapshots? Do the affected snapshots verify without issues? Maybe they where already verified in the past?

SiJux said:
I have a external ceph-cluster (not managed by PVE). PBS runs as VM inside PVE and the storage is mounted via CephFS:
:/pbs-backup/ /pbs-backup ceph name=pbs-backup,noatime,_netdev 0 0

Inside /etc/ceph/ceph.client.pbs-backup.keyring the client-key is stored.

I knonw running PBS as VM is not the best-practice way - but it basically works fine.

Thanks for this information. Please do make sure that the data is consistent there and maybe perform a deep scrub. Also, do not use the noatime flag, garbage collection requires atime updates, see https://pbs.proxmox.com/docs/backup-client.html#garbage-collection

fiona · Jun 26, 2024

SiJux said:
I have around 150 VMs on my PVE and only 11 are affected.

Is there anything special about these VMs that the other VMs don't have?

SiJux · Jun 26, 2024

Chris said:
Do all of these are located on the CephFS storage backend of your PBS? Do all of these show the same 0 sequence when dumping the index magic number?

Yes, all of them are located on the CephFS, and no only the 11 VMs have the problem with the index magix number.

Chris said:
Well removing the storage on Proxmox VE side and re-adding it will not affect the snapshots, so this is not surprising. What will help is removing the affected snapshots, so the new backup runs will not try to load the index file found in the previous snapshot. You should however consider if that is what you want.

Sorry, I expressed myself incorrectly. I've removed the complete backup-store from PVE, from PBS and Ceph. Then I've created a new CephFS-Share, mounted the new share on the PBS, initialized a new pbs-datastore and integrated the new backup-store in PVE.
I've also tried to remove the snapshots with the index-error. then some backups are working again - sometimes.

Chris said:
Are you sure you are verifying all the snapshots? Do the affected snapshots verify without issues? Maybe they where already verified in the past?

I'm pretty shure, I'm verify all the snapshots. Only the snapshots of the view VMs have the problem.

Chris said:
Thanks for this information. Please do make sure that the data is consistent there and maybe perform a deep scrub. Also, do not use the noatime flag, garbage collection requires atime updates, see https://pbs.proxmox.com/docs/backup-client.html#garbage-collection

Thank you, I've changed the mount-flag. ceph tells me, that all data are consistent and a deep scrub runs periodically.

Kind regards
Simon

SiJux · Jun 26, 2024

fiona said:
Is there anything special about these VMs that the other VMs don't have?

I have found nothing special.

Chris · Jun 26, 2024

SiJux said:
Sorry, I expressed myself incorrectly. I've removed the complete backup-store from PVE, from PBS and Ceph. Then I've created a new CephFS-Share, mounted the new share on the PBS, initialized a new pbs-datastore and integrated the new backup-store in PVE.
I've also tried to remove the snapshots with the index-error. then some backups are working again - sometimes.

So you are getting a different error then? If your datastore is empty, then there should be no issues with a previous snapshot containing a corrupt index.

SiJux · Jun 26, 2024

Chris said:
So you are getting a different error then? If your datastore is empty, then there should be no issues with a previous snapshot containing a corrupt index.

The first Backup will be created, and PVE and PBS says everything is fine.
The next backup fails with the error.

Kind Regards
Simon

fiona · Jun 26, 2024

SiJux said:
The first Backup will be created, and PVE and PBS says everything is fine.
The next backup fails with the error.

Is it always the same disk for which it will fail? I.e. is it always the EFI disk for VM 168 that gets the got unknown magic number error? What about the other affected VMs? What disks are problematic for them? Anything in common between those virtual disks?

Chris · Jun 27, 2024

In addition to the questions asked by @fiona: could you also try to exclude that this is related to the CephFS storage by creating and backing up to a PBS local datastore, since you seem to be able to easily reproduce the issue.
Unfortunately I did not manage to reproduce the issue so far on my side, even with a datastore located on CephFS.

SiJux · Jun 28, 2024

Hi Chris,
hi Fiona,

thank you for your reply.
I've checked the Logs right now: The affected disks are mostly EFI-Disks, but for some VMs also normal data-disks.
I'll check if the issue appers on local datastore too.

Regards
Simon

SiJux · Jun 28, 2024

Hi Chris,

I've tried the following:
- Shutdown VM, Remove EFI-Disk, Create new EFI-Disk, Create Backup: Same Problem
- Add a new local Datastore to PBS, Create Backup of affected VM: Backup works.

So I guess a correlation between the disks and CephFS.
Any further ideas?
As I wrote before: Ceph is healthy, no errors and other VMs and backups could be created and are consistent.

Regards
Simon

Chris · Jun 28, 2024

SiJux said:
Hi Chris,

I've tried the following:
- Shutdown VM, Remove EFI-Disk, Create new EFI-Disk, Create Backup: Same Problem
- Add a new local Datastore to PBS, Create Backup of affected VM: Backup works.

So I guess a correlation between the disks and CephFS.
Any further ideas?
As I wrote before: Ceph is healthy, no errors and other VMs and backups could be created and are consistent.

Regards
Simon

Hi,
thanks for the continuous feedback!

After looking at the code more closely without finding any issues and since this cannot be reproduced on the local datastore, this sounds more like an issue/possible bug in CephFS.

Edit: As @fabian pointed out to me there is this upstream fix https://git.kernel.org/pub/scm/linu...h?id=b372e96bd0a32729d55d27f613c8bc80708a82e1, which could be related (although it might be a long shot).
Could you nevertheless try to boot with a more recent kernel for the PBS instance and also use a recent kernel version in your Ceph cluster, e.g. the latest 6.8.8 kernel available in the no-subscription repository?

Also please dump the full first 4k of a corrupt fixed index file and send it to me? Best in a private message, as it could potentially contain sensitive information. You can dump this via head -c4096 <path-to-fixed-image>.img.fidx > fixed-index-header.dump.

Could you provide more information about the Ceph cluster you are using, as well as the client side on PBS?

SiJux · Jul 1, 2024

Hi Chris,

after upgrading to Kernel 6.8.8-2 the problem has disappeared for some more VMs.

Chris said:
Also please dump the full first 4k of a corrupt fixed index file and send it to me? Best in a private message, as it could potentially contain sensitive information. You can dump this via head -c4096 <path-to-fixed-image>.img.fidx > fixed-index-header.dump.

I think you mean with fixed index file, the index file on the local pbs store?

Chris said:
Could you provide more information about the Ceph cluster you are using, as well as the client side on PBS?

Could you tell me the Information you need to know? Than I'll provide them.

Thank you & best regards
Simon

Chris · Jul 1, 2024

SiJux said:
after upgrading to Kernel 6.8.8-2 the problem has disappeared for some more VMs.

Hmm, should have been fixed for all of them if the above mentioned issue was the cause? (I assume you got rid of already corrupted snapshots).

SiJux said:
I think you mean with fixed index file, the index file on the local pbs store?

No, I mean the corrupt fixed index file for the disk in the snapshot, just like in https://forum.proxmox.com/threads/n...annot-do-incremental-backup.78340/post-678151 but this time the full 4k header dump, not just the first few bytes.

SiJux said:
Could you tell me the Information you need to know? Than I'll provide them.

Well, Ceph version and running kernel, used OS and settings for the CephFS might be useful to reproduce and/or narrow down the issue. Also, the used client version on the PBS side and mount parameters (which you already provided).

SiJux · Jul 2, 2024

Hi Chris,

on the PBS (Ceph-Client) was 16.x installed. On the Ceph-Server ist Ceph 18.x installed.
I've done a Upgrade to 18.x on the PBS.
Fore some more VMs the backup works now fine - but not for all vms.
I'll check the next days.

Best regards
Simon

Chris · Jul 5, 2024

SiJux said:
Hi Chris,

on the PBS (Ceph-Client) was 16.x installed. On the Ceph-Server ist Ceph 18.x installed.
I've done a Upgrade to 18.x on the PBS.
Fore some more VMs the backup works now fine - but not for all vms.
I'll check the next days.

Best regards
Simon

Thanks for the provided input, so far I did not reproduce the issue on my side.

Please keep us posted if the issue persists even after the upgrades.

SiJux · Jul 5, 2024

Hi Chris,

the issue is still there.

Best regards
Simon

No previous backup found, cannot do incremental backup

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

We value your privacy