Backup suceeds but ends up failing verification

Thanks, the garbage collection should however be unrelated.

Let me give some more insights on the current hypothesis: The suspicion is that there is a cephfs bug related to caching, making this unfortunately also more difficult to reproduce and debug. This is also why the backup is successful, as checksums and sizes written to disk as seen by the proxmox backup client do match the expected ones, while the underlying filesystem seems to not persist the initial writing of the header, only writing the final bytes when updating the index checksum. This is why the full 4k page is zeroed, except for the checksum written when finishing and closing the fixed index file. Also, so far this seems to only happen with VMs having more than 1 disk attached, which indicates that only a certain sequence of operations and timings leads to this.

So comparing the dynamic debug output of the boguos run with a good run should help to see in the source code where things go wrong.


After doing some testing I was able to get the verification errors again with my homeassistant VM backups using kernal ceph. I ran the bash script and grabbed log dumps, however when the debugging is enabled I just get a checksum error. I also spun up a 2gb small alpine linux vm and grabbed logs form it as well. I got one log with a checksum error and one without. I'll send them to you in a PM.

ERROR: Backup of VM 200 failed - backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum

Bash:
INFO: starting new backup job: vzdump 200 --mode snapshot --storage proxmox-backup-server --notes-template '{{guestname}}' --remove 0 --notification-mode auto --node proxmox01
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-08-06 16:04:35
INFO: status = running
INFO: VM Name: HomeAssistant
INFO: include disk 'scsi0' 'ceph-replicated:vm-200-disk-2' 40G
INFO: include disk 'efidisk0' 'ceph-replicated:vm-200-disk-1' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/200/2024-08-06T23:04:35Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '991ac964-b21b-4cbe-adff-5d591cbf4c1f'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   2% (1.0 GiB of 40.0 GiB) in 3s, read: 350.7 MiB/s, write: 60.0 MiB/s
INFO:   3% (1.2 GiB of 40.0 GiB) in 26s, read: 8.3 MiB/s, write: 3.5 MiB/s
INFO:   4% (1.6 GiB of 40.0 GiB) in 51s, read: 16.6 MiB/s, write: 3.4 MiB/s
INFO:   5% (2.0 GiB of 40.0 GiB) in 1m 20s, read: 13.7 MiB/s, write: 3.9 MiB/s
INFO:   6% (2.4 GiB of 40.0 GiB) in 1m 40s, read: 20.2 MiB/s, write: 7.2 MiB/s
INFO:   7% (2.8 GiB of 40.0 GiB) in 2m 24s, read: 10.4 MiB/s, write: 3.4 MiB/s
INFO:   8% (3.2 GiB of 40.0 GiB) in 2m 50s, read: 14.0 MiB/s, write: 945.2 KiB/s
INFO:   9% (3.6 GiB of 40.0 GiB) in 2m 54s, read: 105.0 MiB/s, write: 41.0 MiB/s
INFO:  10% (4.0 GiB of 40.0 GiB) in 3m 18s, read: 16.7 MiB/s, write: 3.0 MiB/s
INFO:  11% (4.4 GiB of 40.0 GiB) in 3m 34s, read: 25.5 MiB/s, write: 7.8 MiB/s
INFO:  12% (4.9 GiB of 40.0 GiB) in 3m 49s, read: 31.7 MiB/s, write: 4.0 MiB/s
INFO:  13% (5.2 GiB of 40.0 GiB) in 3m 56s, read: 50.9 MiB/s, write: 4.0 MiB/s
INFO:  14% (5.6 GiB of 40.0 GiB) in 4m 2s, read: 73.3 MiB/s, write: 4.0 MiB/s
INFO:  15% (6.0 GiB of 40.0 GiB) in 4m 6s, read: 93.0 MiB/s, write: 6.0 MiB/s
INFO:  16% (6.4 GiB of 40.0 GiB) in 4m 16s, read: 43.6 MiB/s, write: 2.8 MiB/s
INFO:  17% (6.8 GiB of 40.0 GiB) in 4m 20s, read: 98.0 MiB/s, write: 5.0 MiB/s
INFO:  18% (7.2 GiB of 40.0 GiB) in 4m 26s, read: 66.0 MiB/s, write: 6.7 MiB/s
INFO:  19% (7.7 GiB of 40.0 GiB) in 4m 32s, read: 76.7 MiB/s, write: 4.0 MiB/s
INFO:  20% (8.0 GiB of 40.0 GiB) in 4m 38s, read: 60.0 MiB/s, write: 9.3 MiB/s
INFO:  21% (8.4 GiB of 40.0 GiB) in 4m 43s, read: 83.2 MiB/s, write: 6.4 MiB/s
INFO:  22% (8.9 GiB of 40.0 GiB) in 4m 49s, read: 80.7 MiB/s, write: 10.7 MiB/s
INFO:  23% (9.3 GiB of 40.0 GiB) in 5m 6s, read: 23.1 MiB/s, write: 2.1 MiB/s
INFO:  24% (9.6 GiB of 40.0 GiB) in 5m 11s, read: 78.4 MiB/s, write: 2.4 MiB/s
INFO:  25% (10.0 GiB of 40.0 GiB) in 5m 18s, read: 54.9 MiB/s, write: 3.4 MiB/s
INFO:  26% (10.4 GiB of 40.0 GiB) in 5m 24s, read: 64.7 MiB/s, write: 1.3 MiB/s
INFO:  27% (10.8 GiB of 40.0 GiB) in 5m 30s, read: 72.0 MiB/s, write: 5.3 MiB/s
INFO:  28% (11.2 GiB of 40.0 GiB) in 5m 37s, read: 58.3 MiB/s, write: 2.3 MiB/s
INFO:  29% (11.7 GiB of 40.0 GiB) in 5m 42s, read: 89.6 MiB/s, write: 10.4 MiB/s
INFO:  30% (12.0 GiB of 40.0 GiB) in 5m 49s, read: 50.9 MiB/s, write: 3.4 MiB/s
INFO:  31% (12.4 GiB of 40.0 GiB) in 5m 55s, read: 75.3 MiB/s, write: 12.7 MiB/s
INFO:  32% (12.8 GiB of 40.0 GiB) in 6m 43s, read: 7.7 MiB/s, write: 1.2 MiB/s
INFO:  33% (13.2 GiB of 40.0 GiB) in 7m 18s, read: 12.1 MiB/s, write: 702.2 KiB/s
INFO:  34% (13.6 GiB of 40.0 GiB) in 7m 23s, read: 86.4 MiB/s, write: 5.6 MiB/s
INFO:  35% (14.0 GiB of 40.0 GiB) in 7m 31s, read: 47.5 MiB/s, write: 5.0 MiB/s
INFO:  36% (14.5 GiB of 40.0 GiB) in 7m 39s, read: 56.0 MiB/s, write: 9.0 MiB/s
INFO:  37% (14.8 GiB of 40.0 GiB) in 7m 44s, read: 79.2 MiB/s, write: 7.2 MiB/s
INFO:  38% (15.2 GiB of 40.0 GiB) in 7m 49s, read: 75.2 MiB/s, write: 4.8 MiB/s
INFO:  39% (15.6 GiB of 40.0 GiB) in 7m 56s, read: 61.1 MiB/s, write: 1.7 MiB/s
INFO:  40% (16.0 GiB of 40.0 GiB) in 8m 1s, read: 80.0 MiB/s, write: 19.2 MiB/s
INFO:  41% (16.4 GiB of 40.0 GiB) in 8m 10s, read: 48.9 MiB/s, write: 13.3 MiB/s
INFO:  42% (16.8 GiB of 40.0 GiB) in 8m 23s, read: 28.3 MiB/s, write: 8.9 MiB/s
INFO:  43% (17.3 GiB of 40.0 GiB) in 8m 39s, read: 34.8 MiB/s, write: 1.5 MiB/s
INFO:  44% (17.7 GiB of 40.0 GiB) in 8m 45s, read: 58.0 MiB/s, write: 2.0 MiB/s
INFO:  45% (18.0 GiB of 40.0 GiB) in 8m 50s, read: 72.8 MiB/s, write: 3.2 MiB/s
INFO:  46% (18.4 GiB of 40.0 GiB) in 8m 56s, read: 61.3 MiB/s, write: 8.0 MiB/s
INFO:  47% (18.8 GiB of 40.0 GiB) in 9m 1s, read: 88.0 MiB/s, write: 12.0 MiB/s
INFO:  48% (19.3 GiB of 40.0 GiB) in 9m 9s, read: 54.0 MiB/s, write: 4.0 MiB/s
INFO:  49% (19.6 GiB of 40.0 GiB) in 9m 15s, read: 64.7 MiB/s, write: 0 B/s
INFO:  50% (20.0 GiB of 40.0 GiB) in 9m 20s, read: 78.4 MiB/s, write: 3.2 MiB/s
INFO:  51% (20.4 GiB of 40.0 GiB) in 9m 24s, read: 110.0 MiB/s, write: 6.0 MiB/s
INFO:  52% (21.1 GiB of 40.0 GiB) in 9m 27s, read: 234.7 MiB/s, write: 9.3 MiB/s
INFO:  53% (21.5 GiB of 40.0 GiB) in 9m 30s, read: 141.3 MiB/s, write: 2.7 MiB/s
INFO:  56% (22.4 GiB of 40.0 GiB) in 9m 33s, read: 304.0 MiB/s, write: 49.3 MiB/s
INFO:  57% (22.8 GiB of 40.0 GiB) in 9m 52s, read: 19.8 MiB/s, write: 7.8 MiB/s
INFO:  61% (24.8 GiB of 40.0 GiB) in 10m 12s, read: 101.8 MiB/s, write: 614.4 KiB/s
INFO:  66% (26.5 GiB of 40.0 GiB) in 10m 15s, read: 574.7 MiB/s, write: 6.7 MiB/s
INFO:  67% (26.8 GiB of 40.0 GiB) in 10m 20s, read: 70.4 MiB/s, write: 6.4 MiB/s
INFO:  68% (27.2 GiB of 40.0 GiB) in 10m 27s, read: 57.1 MiB/s, write: 0 B/s
INFO:  69% (27.6 GiB of 40.0 GiB) in 10m 33s, read: 72.7 MiB/s, write: 11.3 MiB/s
INFO:  70% (28.1 GiB of 40.0 GiB) in 10m 40s, read: 61.1 MiB/s, write: 7.4 MiB/s
INFO:  71% (28.4 GiB of 40.0 GiB) in 10m 44s, read: 89.0 MiB/s, write: 4.0 MiB/s
INFO:  72% (28.8 GiB of 40.0 GiB) in 10m 51s, read: 61.1 MiB/s, write: 5.7 MiB/s
INFO:  73% (29.3 GiB of 40.0 GiB) in 10m 57s, read: 80.7 MiB/s, write: 14.0 MiB/s
INFO:  74% (29.7 GiB of 40.0 GiB) in 11m 7s, read: 37.2 MiB/s, write: 9.6 MiB/s
INFO:  75% (30.0 GiB of 40.0 GiB) in 11m 18s, read: 32.7 MiB/s, write: 10.9 MiB/s
INFO:  76% (30.4 GiB of 40.0 GiB) in 11m 33s, read: 26.9 MiB/s, write: 4.5 MiB/s
INFO:  77% (30.9 GiB of 40.0 GiB) in 11m 42s, read: 51.1 MiB/s, write: 4.0 MiB/s
INFO:  78% (31.3 GiB of 40.0 GiB) in 11m 48s, read: 74.0 MiB/s, write: 14.7 MiB/s
INFO:  79% (31.6 GiB of 40.0 GiB) in 11m 56s, read: 43.0 MiB/s, write: 4.0 MiB/s
INFO:  80% (32.1 GiB of 40.0 GiB) in 12m 2s, read: 73.3 MiB/s, write: 2.7 MiB/s
INFO:  81% (32.4 GiB of 40.0 GiB) in 12m 8s, read: 63.3 MiB/s, write: 1.3 MiB/s
INFO:  82% (33.0 GiB of 40.0 GiB) in 12m 11s, read: 186.7 MiB/s, write: 9.3 MiB/s
INFO:  94% (37.8 GiB of 40.0 GiB) in 12m 14s, read: 1.6 GiB/s, write: 5.3 MiB/s
INFO:  98% (39.2 GiB of 40.0 GiB) in 12m 17s, read: 490.7 MiB/s, write: 0 B/s
INFO:  99% (39.6 GiB of 40.0 GiB) in 12m 20s, read: 138.7 MiB/s, write: 0 B/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 12m 29s, read: 42.7 MiB/s, write: 58.7 KiB/s
ERROR: backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 200 failed - backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum
INFO: Failed at 2024-08-06 16:17:36
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors
 
  • Like
Reactions: BriBriBri
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
 
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
I don't see kernel 6.5.13-6-pve in the apt-get list. The highest version I see is of the 6.5.13 kernel is proxmox-kernel-6.5.13-5-pve. Do I need to add a specific repo to get that version? Or will proxmox-kernel-6.5.13-5-pve work to test?
 
Last edited:
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
I've run a bunch of backups on proxmox-kernel-6.5.13-5-pve without issue. I'll keep running random backup and scheduled backups and see if the issue comes back on this version of the kernel.
 
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
 
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
Will do! I'll switch back to the current kernel and test it out. As a note proxmox-kernel-6.5.13-5-pve remained issue free since switching to it on the 7th.
 
Last edited:
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
I think you are on the right track. With rasize=0 I haven't be able to recreate the issue. When I removed rasize and did a backup I got the issue again.

No error with this with this in fstab
10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,rasize=0,_netdev 0 0

I get "Unable to open fixed index" with this in fstab
10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,_netdev 0 0

I'll run a few more tests just to make sure, but it does appear to be the issue with the readahead size.
 
Last edited:
I opened https://bugzilla.proxmox.com/show_bug.cgi?id=5683, where the Proxmox team have posted that they have narrowed this down to a kernel bug introduced in 6.8. Therefore, I've downgraded my kernel to 6.5 to avoid the speed penalty from rasize=0

Code:
hi  proxmox-default-kernel             1.0.1                               all          Default Proxmox Kernel Image
ii  proxmox-kernel-6.5                 6.5.13-5                            all          Latest Proxmox Kernel Image
ii  proxmox-kernel-6.5.13-5-pve-signed 6.5.13-5                            amd64        Proxmox Kernel Image (signed)

Code:
apt-mark hold  proxmox-default-kernel
 
My situation also evolved somewhat:
- there were some backup errors that I did not manage to catch either :(
- After that, I moved the ns folder to a separate block device (backed by the same ceph that provides the cephfs)
- There were no more such errors (i.e. empty .fidx files) during backup any more after that.

Maybe cephfs-fuse cloud also have fixed the problem, but cephfs-fuse is slower that the in-kernel driver (which also is somewhat of a bottleneck currently), and using that for saving chunks would be really painful. Or I would have needed a separate cephfs...
 
Hi, I don't think it's a cephfs problem, I have a PVE 8.2.4 and PBS 3.2-7 server without a cluster and without using cephfs and the same thing happens to me.

With kernel 6.5 (PBS) and 6.8 (PVE) the same thing happens, should I put 6.5 on both ends?

preview


I upgraded from PBS 2 to 3 and now I have these errors, which I never had in PBS 2.x.

What logs can I provide to help? :D
 
My situation also evolved somewhat:
- there were some backup errors that I did not manage to catch either :(
- After that, I moved the ns folder to a separate block device (backed by the same ceph that provides the cephfs)
- There were no more such errors (i.e. empty .fidx files) during backup any more after that.

Maybe cephfs-fuse cloud also have fixed the problem, but cephfs-fuse is slower that the in-kernel driver (which also is somewhat of a bottleneck currently), and using that for saving chunks would be really painful. Or I would have needed a separate cephfs...
Okay, so you basically eliminated the cephfs client by moving to a RDB disk, that is a possible workaround. We are currently in contact with the kernel developers to debug the issue.
 
Hi, I don't think it's a cephfs problem, I have a PVE 8.2.4 and PBS 3.2-7 server without a cluster and without using cephfs and the same thing happens to me.

With kernel 6.5 (PBS) and 6.8 (PVE) the same thing happens, should I put 6.5 on both ends?

preview


I upgraded from PBS 2 to 3 and now I have these errors, which I never had in PBS 2.x.

What logs can I provide to help? :D
Hi,
thanks for joining in, but your issue seems to be somewhat different. Can you please as first step check that the fixed index file is indeed corrupted in the same manner by checking the hexdump as described in https://forum.proxmox.com/threads/backup-suceeds-but-ends-up-failing-verification.149249/post-675825

Does the issue also appear if you delete the corrupt snapshot and recreate a new backup snapshot?
 
Moving the ns directory (that contains the fidx files) to another filesystem is a working workaround. What you _must_ not do is to symlink the ns directory (or else garbage collection will delete most of your chunks, see https://forum.proxmox.com/threads/garbage-collection-phase1-does-not-mark-m-any-chunks.155114/ ), you must mount it directly!

The original error was still there a few days ago when I tried to move the ns folder back to the cephfs, but the current workaround (i.e. having the ns folder not on cephfs) works.
 
Hi,
please try to see if the fixed index file corruption persists on the latest proxmox-kernel in version 6.8.12-3, which at the time of writing is available in the PBS test repository [0]. This version includes a tentative fix for a related issue.
Required steps to get the latest kernel are:
  1. Adapt the sources to enable the pbstest repo as described in [0]
  2. apt update
  3. apt install proxmox-kernel-6.8
  4. reboot
Please do not forget to reboot into the new kernel after installing the upgrade, feedback is very much appreciated!

[0] https://pbs.proxmox.com/docs/installation.html#proxmox-backup-test-repository
 
I upgraded to 6.8.12-3-pve, moved the ns folder back to cephfs, and backed up all of the smaller VMs that have multiple disks. For now, these seem to have no errors (I tested it with the file restore listing, which usually fails for a bad backup). I will wait for at least a week before declaring victory. :)
 
  • Like
Reactions: Chris
Hi there !
I had the same issue (or at least similar), where my backup verification jobs were failing for my Linux Mint VM.
After switching to test repo and updating to new kernel 6.8.12-3 (from the 12-2), I took a new backup and verified : all is good now !
 
  • Like
Reactions: Chris

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!