Backup suceeds but ends up failing verification

Thanks, the garbage collection should however be unrelated.

Let me give some more insights on the current hypothesis: The suspicion is that there is a cephfs bug related to caching, making this unfortunately also more difficult to reproduce and debug. This is also why the backup is successful, as checksums and sizes written to disk as seen by the proxmox backup client do match the expected ones, while the underlying filesystem seems to not persist the initial writing of the header, only writing the final bytes when updating the index checksum. This is why the full 4k page is zeroed, except for the checksum written when finishing and closing the fixed index file. Also, so far this seems to only happen with VMs having more than 1 disk attached, which indicates that only a certain sequence of operations and timings leads to this.

So comparing the dynamic debug output of the boguos run with a good run should help to see in the source code where things go wrong.


After doing some testing I was able to get the verification errors again with my homeassistant VM backups using kernal ceph. I ran the bash script and grabbed log dumps, however when the debugging is enabled I just get a checksum error. I also spun up a 2gb small alpine linux vm and grabbed logs form it as well. I got one log with a checksum error and one without. I'll send them to you in a PM.

ERROR: Backup of VM 200 failed - backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum

Bash:
INFO: starting new backup job: vzdump 200 --mode snapshot --storage proxmox-backup-server --notes-template '{{guestname}}' --remove 0 --notification-mode auto --node proxmox01
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-08-06 16:04:35
INFO: status = running
INFO: VM Name: HomeAssistant
INFO: include disk 'scsi0' 'ceph-replicated:vm-200-disk-2' 40G
INFO: include disk 'efidisk0' 'ceph-replicated:vm-200-disk-1' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/200/2024-08-06T23:04:35Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '991ac964-b21b-4cbe-adff-5d591cbf4c1f'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   2% (1.0 GiB of 40.0 GiB) in 3s, read: 350.7 MiB/s, write: 60.0 MiB/s
INFO:   3% (1.2 GiB of 40.0 GiB) in 26s, read: 8.3 MiB/s, write: 3.5 MiB/s
INFO:   4% (1.6 GiB of 40.0 GiB) in 51s, read: 16.6 MiB/s, write: 3.4 MiB/s
INFO:   5% (2.0 GiB of 40.0 GiB) in 1m 20s, read: 13.7 MiB/s, write: 3.9 MiB/s
INFO:   6% (2.4 GiB of 40.0 GiB) in 1m 40s, read: 20.2 MiB/s, write: 7.2 MiB/s
INFO:   7% (2.8 GiB of 40.0 GiB) in 2m 24s, read: 10.4 MiB/s, write: 3.4 MiB/s
INFO:   8% (3.2 GiB of 40.0 GiB) in 2m 50s, read: 14.0 MiB/s, write: 945.2 KiB/s
INFO:   9% (3.6 GiB of 40.0 GiB) in 2m 54s, read: 105.0 MiB/s, write: 41.0 MiB/s
INFO:  10% (4.0 GiB of 40.0 GiB) in 3m 18s, read: 16.7 MiB/s, write: 3.0 MiB/s
INFO:  11% (4.4 GiB of 40.0 GiB) in 3m 34s, read: 25.5 MiB/s, write: 7.8 MiB/s
INFO:  12% (4.9 GiB of 40.0 GiB) in 3m 49s, read: 31.7 MiB/s, write: 4.0 MiB/s
INFO:  13% (5.2 GiB of 40.0 GiB) in 3m 56s, read: 50.9 MiB/s, write: 4.0 MiB/s
INFO:  14% (5.6 GiB of 40.0 GiB) in 4m 2s, read: 73.3 MiB/s, write: 4.0 MiB/s
INFO:  15% (6.0 GiB of 40.0 GiB) in 4m 6s, read: 93.0 MiB/s, write: 6.0 MiB/s
INFO:  16% (6.4 GiB of 40.0 GiB) in 4m 16s, read: 43.6 MiB/s, write: 2.8 MiB/s
INFO:  17% (6.8 GiB of 40.0 GiB) in 4m 20s, read: 98.0 MiB/s, write: 5.0 MiB/s
INFO:  18% (7.2 GiB of 40.0 GiB) in 4m 26s, read: 66.0 MiB/s, write: 6.7 MiB/s
INFO:  19% (7.7 GiB of 40.0 GiB) in 4m 32s, read: 76.7 MiB/s, write: 4.0 MiB/s
INFO:  20% (8.0 GiB of 40.0 GiB) in 4m 38s, read: 60.0 MiB/s, write: 9.3 MiB/s
INFO:  21% (8.4 GiB of 40.0 GiB) in 4m 43s, read: 83.2 MiB/s, write: 6.4 MiB/s
INFO:  22% (8.9 GiB of 40.0 GiB) in 4m 49s, read: 80.7 MiB/s, write: 10.7 MiB/s
INFO:  23% (9.3 GiB of 40.0 GiB) in 5m 6s, read: 23.1 MiB/s, write: 2.1 MiB/s
INFO:  24% (9.6 GiB of 40.0 GiB) in 5m 11s, read: 78.4 MiB/s, write: 2.4 MiB/s
INFO:  25% (10.0 GiB of 40.0 GiB) in 5m 18s, read: 54.9 MiB/s, write: 3.4 MiB/s
INFO:  26% (10.4 GiB of 40.0 GiB) in 5m 24s, read: 64.7 MiB/s, write: 1.3 MiB/s
INFO:  27% (10.8 GiB of 40.0 GiB) in 5m 30s, read: 72.0 MiB/s, write: 5.3 MiB/s
INFO:  28% (11.2 GiB of 40.0 GiB) in 5m 37s, read: 58.3 MiB/s, write: 2.3 MiB/s
INFO:  29% (11.7 GiB of 40.0 GiB) in 5m 42s, read: 89.6 MiB/s, write: 10.4 MiB/s
INFO:  30% (12.0 GiB of 40.0 GiB) in 5m 49s, read: 50.9 MiB/s, write: 3.4 MiB/s
INFO:  31% (12.4 GiB of 40.0 GiB) in 5m 55s, read: 75.3 MiB/s, write: 12.7 MiB/s
INFO:  32% (12.8 GiB of 40.0 GiB) in 6m 43s, read: 7.7 MiB/s, write: 1.2 MiB/s
INFO:  33% (13.2 GiB of 40.0 GiB) in 7m 18s, read: 12.1 MiB/s, write: 702.2 KiB/s
INFO:  34% (13.6 GiB of 40.0 GiB) in 7m 23s, read: 86.4 MiB/s, write: 5.6 MiB/s
INFO:  35% (14.0 GiB of 40.0 GiB) in 7m 31s, read: 47.5 MiB/s, write: 5.0 MiB/s
INFO:  36% (14.5 GiB of 40.0 GiB) in 7m 39s, read: 56.0 MiB/s, write: 9.0 MiB/s
INFO:  37% (14.8 GiB of 40.0 GiB) in 7m 44s, read: 79.2 MiB/s, write: 7.2 MiB/s
INFO:  38% (15.2 GiB of 40.0 GiB) in 7m 49s, read: 75.2 MiB/s, write: 4.8 MiB/s
INFO:  39% (15.6 GiB of 40.0 GiB) in 7m 56s, read: 61.1 MiB/s, write: 1.7 MiB/s
INFO:  40% (16.0 GiB of 40.0 GiB) in 8m 1s, read: 80.0 MiB/s, write: 19.2 MiB/s
INFO:  41% (16.4 GiB of 40.0 GiB) in 8m 10s, read: 48.9 MiB/s, write: 13.3 MiB/s
INFO:  42% (16.8 GiB of 40.0 GiB) in 8m 23s, read: 28.3 MiB/s, write: 8.9 MiB/s
INFO:  43% (17.3 GiB of 40.0 GiB) in 8m 39s, read: 34.8 MiB/s, write: 1.5 MiB/s
INFO:  44% (17.7 GiB of 40.0 GiB) in 8m 45s, read: 58.0 MiB/s, write: 2.0 MiB/s
INFO:  45% (18.0 GiB of 40.0 GiB) in 8m 50s, read: 72.8 MiB/s, write: 3.2 MiB/s
INFO:  46% (18.4 GiB of 40.0 GiB) in 8m 56s, read: 61.3 MiB/s, write: 8.0 MiB/s
INFO:  47% (18.8 GiB of 40.0 GiB) in 9m 1s, read: 88.0 MiB/s, write: 12.0 MiB/s
INFO:  48% (19.3 GiB of 40.0 GiB) in 9m 9s, read: 54.0 MiB/s, write: 4.0 MiB/s
INFO:  49% (19.6 GiB of 40.0 GiB) in 9m 15s, read: 64.7 MiB/s, write: 0 B/s
INFO:  50% (20.0 GiB of 40.0 GiB) in 9m 20s, read: 78.4 MiB/s, write: 3.2 MiB/s
INFO:  51% (20.4 GiB of 40.0 GiB) in 9m 24s, read: 110.0 MiB/s, write: 6.0 MiB/s
INFO:  52% (21.1 GiB of 40.0 GiB) in 9m 27s, read: 234.7 MiB/s, write: 9.3 MiB/s
INFO:  53% (21.5 GiB of 40.0 GiB) in 9m 30s, read: 141.3 MiB/s, write: 2.7 MiB/s
INFO:  56% (22.4 GiB of 40.0 GiB) in 9m 33s, read: 304.0 MiB/s, write: 49.3 MiB/s
INFO:  57% (22.8 GiB of 40.0 GiB) in 9m 52s, read: 19.8 MiB/s, write: 7.8 MiB/s
INFO:  61% (24.8 GiB of 40.0 GiB) in 10m 12s, read: 101.8 MiB/s, write: 614.4 KiB/s
INFO:  66% (26.5 GiB of 40.0 GiB) in 10m 15s, read: 574.7 MiB/s, write: 6.7 MiB/s
INFO:  67% (26.8 GiB of 40.0 GiB) in 10m 20s, read: 70.4 MiB/s, write: 6.4 MiB/s
INFO:  68% (27.2 GiB of 40.0 GiB) in 10m 27s, read: 57.1 MiB/s, write: 0 B/s
INFO:  69% (27.6 GiB of 40.0 GiB) in 10m 33s, read: 72.7 MiB/s, write: 11.3 MiB/s
INFO:  70% (28.1 GiB of 40.0 GiB) in 10m 40s, read: 61.1 MiB/s, write: 7.4 MiB/s
INFO:  71% (28.4 GiB of 40.0 GiB) in 10m 44s, read: 89.0 MiB/s, write: 4.0 MiB/s
INFO:  72% (28.8 GiB of 40.0 GiB) in 10m 51s, read: 61.1 MiB/s, write: 5.7 MiB/s
INFO:  73% (29.3 GiB of 40.0 GiB) in 10m 57s, read: 80.7 MiB/s, write: 14.0 MiB/s
INFO:  74% (29.7 GiB of 40.0 GiB) in 11m 7s, read: 37.2 MiB/s, write: 9.6 MiB/s
INFO:  75% (30.0 GiB of 40.0 GiB) in 11m 18s, read: 32.7 MiB/s, write: 10.9 MiB/s
INFO:  76% (30.4 GiB of 40.0 GiB) in 11m 33s, read: 26.9 MiB/s, write: 4.5 MiB/s
INFO:  77% (30.9 GiB of 40.0 GiB) in 11m 42s, read: 51.1 MiB/s, write: 4.0 MiB/s
INFO:  78% (31.3 GiB of 40.0 GiB) in 11m 48s, read: 74.0 MiB/s, write: 14.7 MiB/s
INFO:  79% (31.6 GiB of 40.0 GiB) in 11m 56s, read: 43.0 MiB/s, write: 4.0 MiB/s
INFO:  80% (32.1 GiB of 40.0 GiB) in 12m 2s, read: 73.3 MiB/s, write: 2.7 MiB/s
INFO:  81% (32.4 GiB of 40.0 GiB) in 12m 8s, read: 63.3 MiB/s, write: 1.3 MiB/s
INFO:  82% (33.0 GiB of 40.0 GiB) in 12m 11s, read: 186.7 MiB/s, write: 9.3 MiB/s
INFO:  94% (37.8 GiB of 40.0 GiB) in 12m 14s, read: 1.6 GiB/s, write: 5.3 MiB/s
INFO:  98% (39.2 GiB of 40.0 GiB) in 12m 17s, read: 490.7 MiB/s, write: 0 B/s
INFO:  99% (39.6 GiB of 40.0 GiB) in 12m 20s, read: 138.7 MiB/s, write: 0 B/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 12m 29s, read: 42.7 MiB/s, write: 58.7 KiB/s
ERROR: backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 200 failed - backup close image failed: command error: fixed writer 'drive-scsi0.img.fidx' close failed - got unexpected checksum
INFO: Failed at 2024-08-06 16:17:36
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors
 
  • Like
Reactions: BriBriBri
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
 
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
I don't see kernel 6.5.13-6-pve in the apt-get list. The highest version I see is of the 6.5.13 kernel is proxmox-kernel-6.5.13-5-pve. Do I need to add a specific repo to get that version? Or will proxmox-kernel-6.5.13-5-pve work to test?
 
Last edited:
Thanks a lot for your efforts and providing the traces. Not sure if this will help as the error is not the same anymore, but I will have a closer look.

Please do also try to see if the issue is present when booting into kernel 6.5.13-6-pve on the PBS, using the in-kernel cephfs client.
I've run a bunch of backups on proxmox-kernel-6.5.13-5-pve without issue. I'll keep running random backup and scheduled backups and see if the issue comes back on this version of the kernel.
 
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
 
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
Will do! I'll switch back to the current kernel and test it out. As a note proxmox-kernel-6.5.13-5-pve remained issue free since switching to it on the 7th.
 
Last edited:
Hi,

there is currently a CephFS issue under investigation which is related to the readahead size, as configurable when mounting the ceph filesystem via the rasize option [0]. Please check if your issues persist even if the CephFS backing the datastore is mounted by adding an additional rasize=0 while using an affected kernel. This would help in gaining confidence that the two issues are indeed related, thx!

[0] https://docs.ceph.com/en/reef/man/8/mount.ceph/#advanced
I think you are on the right track. With rasize=0 I haven't be able to recreate the issue. When I removed rasize and did a backup I got the issue again.

No error with this with this in fstab
10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,rasize=0,_netdev 0 0

I get "Unable to open fixed index" with this in fstab
10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,_netdev 0 0

I'll run a few more tests just to make sure, but it does appear to be the issue with the readahead size.
 
Last edited:
I opened https://bugzilla.proxmox.com/show_bug.cgi?id=5683, where the Proxmox team have posted that they have narrowed this down to a kernel bug introduced in 6.8. Therefore, I've downgraded my kernel to 6.5 to avoid the speed penalty from rasize=0

Code:
hi  proxmox-default-kernel             1.0.1                               all          Default Proxmox Kernel Image
ii  proxmox-kernel-6.5                 6.5.13-5                            all          Latest Proxmox Kernel Image
ii  proxmox-kernel-6.5.13-5-pve-signed 6.5.13-5                            amd64        Proxmox Kernel Image (signed)

Code:
apt-mark hold  proxmox-default-kernel
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!