Backup suceeds but ends up failing verification

pallingerpeter

New Member
Jan 17, 2024
8
0
1
For all of my VMs, backups always succeed.
For some of them however, the result is bad:
  • they fail verification
  • the first stored disk cannot be downloaded, the http response is:
    • unable to read fixed index <filename> - Unable to open fixed index <filename> - got unknown magic number
  • of course, the file restire function also gives an error:
    • Starting VM failed. See output above for more information. (500)
    • I failed to find any further information in the logs (or in the response).
The error does occur for some of the vms, and even then, not completely consistently. The only common thing about them is that all such VMs have at least two disks.
Other observations:
  • Repeated backups after a bad result usually (but not always) also create a bad result.
  • Repeated backups after a good result (even if there was a bad result before that) usually (but not always) also create a good result.
  • Migrating a VM to another node and then doing a new backup sometimes creates a good result.
  • Setting up replication (and waiting for it to finish) for a VM and then doing a new backup usually creates a good result.
I cannot give a complete report yet as re-doing backups takes about 20minutes + 60minutes of verification (the VMs are about 200GiB, there were some smaller ones but they decided to fix themselves after a manual test backup and are producing good backups ever since :/ ), .

Has anyone seen something similar?
If yes, what may be the problem?
If no, does anyone know how to debug a backup or verification process?
 
For some more information here is the output for a verification job:
Code:
2024-06-19T12:46:54+02:00: Automatically verifying newly added snapshot
2024-06-19T12:46:54+02:00: verify backup-cephfs:vm/440/2024-06-19T10:32:19Z
2024-06-19T12:46:54+02:00:   check qemu-server.conf.blob
2024-06-19T12:46:54+02:00:   check drive-virtio1.img.fidx
2024-06-19T13:18:24+02:00:   verified 62320.32/194312.00 MiB in 1890.00 seconds, speed 32.97/102.81 MiB/s (0 errors)
2024-06-19T13:18:24+02:00:   check drive-virtio0.img.fidx
2024-06-19T13:18:24+02:00: verify backup-cephfs:vm/440/2024-06-19T10:32:19Z/drive-virtio0.img.fidx failed: Unable to open fixed index "/mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10:32:19Z/drive-virtio0.img.fidx" - got unknown magic number
2024-06-19T13:18:24+02:00: TASK ERROR: verification failed - please check the log for details

I can send drive-virtio0.img.fidx if anyone is interested, it really starts with 32 zero bytes, which is most probably not a magic number.
Code:
head -c100 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx | hexdump
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000020 1856 2b7b b8cc 033f 47b9 c406 5c4b 0633
0000030 6e10 0427 fa5d 07cc 720e 7836 9c66 661d
0000040 0000 0000 0000 0000 0000 0000 0000 0000
*
0000060 0000 0000                              
0000064
 
please post the output of od --format=u1 --read-bytes=8 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx

The output should match exactly 0000000 47 127 65 237 145 253 15 205

if not (and it seems like) your fixed index is currupted or something else than a fixed index. the magic number is a 8-byte long identifier of the file type.

Does the time stamp of your fixed index match the date/time when the backup was created?
 
Last edited:
First of all, thank you for replying!
please post the output of od --format=u1 --read-bytes=8 /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/drive-virtio0.img.fidx
Code:
0000000   0   0   0   0   0   0   0   0
0000010
if not (and it seems like) your fixed index is currupted or something else than a fixed index. the magic number is a 8-byte long identifier of the file type.
Yeah, I get that. The question remains: why is it generated so, and how can I avoid that? How can I debug the backup process?
Does the time stamp of your fixed index match the date/time when the backup was created?
Code:
ls -la /mnt/backup-cephfs/ns/backup/vm/440/2024-06-19T10\:32\:19Z/
total 1592
drwxr-xr-x  2 backup backup       5 Jun 19 13:18 .
drwxr-xr-x 26 backup backup      25 Jun 19 12:32 ..
-rw-r--r--  1 backup backup    2369 Jun 19 12:46 client.log.blob
-rw-r--r--  1 backup backup   20992 Jun 19 12:46 drive-virtio0.img.fidx
-rw-r--r--  1 backup backup 1604096 Jun 19 12:46 drive-virtio1.img.fidx
-rw-r--r--  1 backup backup     583 Jun 19 13:18 index.json.blob
-rw-r--r--  1 backup backup     536 Jun 19 12:32 qemu-server.conf.blob
Yes, the fidx date matches the final time found in the client log.
 
Last edited:
Update: I ran a lot of new backups, and observed the happenings in the backup metadata directory with the following script:

Code:
while sleep 10; do for i in 0 1 ; do od --format=u
1 drive-virtio$i*fidx| head -n1 ; od --format=u1 drive-virtio$i*fidx| wc -l ; done; done
This prints the first 16 bytes (which contain the magic number) and the approximate number of non-zero 16-byte chunks for both fidx files every 10 seconds.

In the beginning, both drive-virtio0.img.tmp_fidx and drive-virtio1.img.tmp_fidx were created, with a good magic number and not much else (qemu-server.conf.blob was also created, but that is not relevant):
Then drive-virtio1.img.tmp_fidx was being filled up with non-zeroes as virtio1 was being backed up according to the server job log, roughly the same percent being non-zero as the disk's completion according to the log. The output is like this:
Code:
0000000  47 127  65 237 145 253  15 205  55 172 206  80 104 185  68 104
8
0000000  47 127  65 237 145 253  15 205 151 195   6 194 119 143  77 230
54286

Then at the end, drive-virtio0.img.tmp_fidx is filled up with data. Unfortunately, both of my bad-backup VMs decided that they backup nicely if I observe them (Scrödinger's backup :p ). I will continue to try to reproduce the error while observing...
 
Last edited:
Hi,
please provide the output of pveversion -v for the Proxmox VE nodes and the proxmox-backup-manager versions --verbose of the Proxmox Backup Server host.

These intermitten type of errors might indicate an underlying hardware issue. So please also check the disks health of those backing the storage on the PBS side as well as the system memory by performing a prolonged memory test. What storage disks are you using?
 
Hi,
please provide the output of pveversion -v for the Proxmox VE nodes and the proxmox-backup-manager versions --verbose of the Proxmox Backup Server host.
this may be not that useful, as I recently upgraded all pve nodes
proxmox-ve: 8.2.0 (running kernel: 6.8.8-2-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.0-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
proxmox-backup 3.2.0 running kernel: 6.8.8-2-pve
proxmox-backup-server 3.2.6-1 running version: 3.2.6
proxmox-kernel-helper 8.1.0
pve-kernel-6.2 8.0.5
proxmox-kernel-6.8 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed 6.8.8-1
proxmox-kernel-6.2.16-20-pve 6.2.16-20
proxmox-kernel-6.2 6.2.16-20
pve-kernel-6.2.16-3-pve 6.2.16-3
ifupdown2 3.2.0-1+pmx8
libjs-extjs 7.0.0-4
proxmox-backup-docs 3.2.6-1
proxmox-backup-client 3.2.6-1
proxmox-mail-forward 0.2.3
proxmox-mini-journalreader 1.4.0
proxmox-offline-mirror-helper unknown
proxmox-widget-toolkit 4.2.3
pve-xtermjs 5.3.0-3
smartmontools 7.3-pve1
zfsutils-linux 2.2.4-pve1

These intermitten type of errors might indicate an underlying hardware issue. So please also check the disks health of those backing the storage on the PBS side as well as the system memory by performing a prolonged memory test. What storage disks are you using?
Hardware failure is quite possible, though disk failure is unlikely, as data is on a ceph cluster with 6+4 redundancy, and I recently replaced the memory (increased it, which required replacement; but of course the "new" memory could still be bad) and run some memory testing, but only on-line so far.
I could still be CPU/motherboard failure or ceph or other software failure.

There were some errors since the memory replacement, but I could not pin them down, and could not reproduce them either yet. I extended my watcher script to detect new backups and monitor and log the fidx files' state, so maybe I will be able to catch the error later.
 
this may be not that useful, as I recently upgraded all pve nodes
proxmox-ve: 8.2.0 (running kernel: 6.8.8-2-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.0-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
proxmox-backup 3.2.0 running kernel: 6.8.8-2-pve
proxmox-backup-server 3.2.6-1 running version: 3.2.6
proxmox-kernel-helper 8.1.0
pve-kernel-6.2 8.0.5
proxmox-kernel-6.8 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed 6.8.8-1
proxmox-kernel-6.2.16-20-pve 6.2.16-20
proxmox-kernel-6.2 6.2.16-20
pve-kernel-6.2.16-3-pve 6.2.16-3
ifupdown2 3.2.0-1+pmx8
libjs-extjs 7.0.0-4
proxmox-backup-docs 3.2.6-1
proxmox-backup-client 3.2.6-1
proxmox-mail-forward 0.2.3
proxmox-mini-journalreader 1.4.0
proxmox-offline-mirror-helper unknown
proxmox-widget-toolkit 4.2.3
pve-xtermjs 5.3.0-3
smartmontools 7.3-pve1
zfsutils-linux 2.2.4-pve1


Hardware failure is quite possible, though disk failure is unlikely, as data is on a ceph cluster with 6+4 redundancy, and I recently replaced the memory (increased it, which required replacement; but of course the "new" memory could still be bad) and run some memory testing, but only on-line so far.
I could still be CPU/motherboard failure or ceph or other software failure.

There were some errors since the memory replacement, but I could not pin them down, and could not reproduce them either yet. I extended my watcher script to detect new backups and monitor and log the fidx files' state, so maybe I will be able to catch the error later.
Thanks for the feedback, please also provide some details about your Ceph cluster and how you mount it on the PBS side.

There was another user reporting this issue, the common factor for both of you being the datastore located on CephFS. Please have a look at https://forum.proxmox.com/threads/n...annot-do-incremental-backup.78340/post-676129 and following...
 
Thanks for the feedback, please also provide some details about your Ceph cluster and how you mount it on the PBS side.
Well, the cluster consists of a single machine (for now, this was supposed to be a proof-of-concept that could be expanded later), with 12 4TB HDDs and 2 100GB SSDs for system.

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.xx.xx.xx/20
fsid = 14c82bad-3859-4add-b8aa-45646687dc5a
mon_allow_pool_delete = true
mon_host = 10.xx.xx.xx 10.xx.xx.xx
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.xx.xx.xx/20

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pvebackup1]
host = pvebackup1
mds_standby_for_name = pve

[mon.pvebackup]
addr = 10.xx.xx.xx
host = pvebackup1

[mon.pvebackup1-4]
public_addr = 10.xx.xx.xx

ceph_config_db_2024-07-02_13-25-54.png

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pvebackup {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.00000
alg straw2
hash 0 # rjenkins1
}
host pvebackup1 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 43.65839
alg straw2
hash 0 # rjenkins1
item osd.1 weight 3.63820
item osd.0 weight 3.63820
item osd.2 weight 3.63820
item osd.3 weight 3.63820
item osd.5 weight 3.63820
item osd.6 weight 3.63820
item osd.7 weight 3.63820
item osd.8 weight 3.63820
item osd.9 weight 3.63820
item osd.10 weight 3.63820
item osd.11 weight 3.63820
item osd.4 weight 3.63820
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 43.65839
alg straw2
hash 0 # rjenkins1
item pvebackup weight 0.00000
item pvebackup1 weight 43.65839
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule ecpool {
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule replicated_rule_osd {
id 2
type replicated
step take default
step choose firstn 0 type osd
step emit
}
rule 10_4_ecpool {
id 3
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step choose indep 0 type osd
step emit
}

# end crush map

10.xx.xx.xx:6789:/ /mnt/backup-cephfs ceph name=backup,relatime,_netdev,secretfile=/etc/ceph/ceph.client.backup.secret 0 2
 
Last edited:
Well, the cluster consists of a single machine (for now, this was supposed to be a proof-of-concept that could be expanded later), with 12 4TB HDDs and 2 100GB SSDs for system.

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.xx.xx.xx/20
fsid = 14c82bad-3859-4add-b8aa-45646687dc5a
mon_allow_pool_delete = true
mon_host = 10.xx.xx.xx 10.xx.xx.xx
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.xx.xx.xx/20

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pvebackup1]
host = pvebackup1
mds_standby_for_name = pve

[mon.pvebackup]
addr = 10.xx.xx.xx
host = pvebackup1

[mon.pvebackup1-4]
public_addr = 10.xx.xx.xx


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pvebackup {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.00000
alg straw2
hash 0 # rjenkins1
}
host pvebackup1 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 43.65839
alg straw2
hash 0 # rjenkins1
item osd.1 weight 3.63820
item osd.0 weight 3.63820
item osd.2 weight 3.63820
item osd.3 weight 3.63820
item osd.5 weight 3.63820
item osd.6 weight 3.63820
item osd.7 weight 3.63820
item osd.8 weight 3.63820
item osd.9 weight 3.63820
item osd.10 weight 3.63820
item osd.11 weight 3.63820
item osd.4 weight 3.63820
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 43.65839
alg straw2
hash 0 # rjenkins1
item pvebackup weight 0.00000
item pvebackup1 weight 43.65839
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule ecpool {
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule replicated_rule_osd {
id 2
type replicated
step take default
step choose firstn 0 type osd
step emit
}
rule 10_4_ecpool {
id 3
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step choose indep 0 type osd
step emit
}

# end crush map

10.xx.xx.xx:6789:/ /mnt/backup-cephfs ceph name=backup,relatime,_netdev,secretfile=/etc/ceph/ceph.client.backup.secret 0 2
Thanks for the feedback, unfortunately I am not able to reproduce this. Please keep us posted if the problem persists after the upgrades.
 
It recurred again, but I again could not reproduce the bug while observing. Truly a quantum bug. :p
It very well may be a cephfs problem, I will move the fidx files to another file system to rule that out. The stange thing is that the chunk files do not produce such errors (otherwise verifications would fail I assume).
 
It recurred again, but I again could not reproduce the bug while observing. Truly a quantum bug. :p
It very well may be a cephfs problem, I will move the fidx files to another file system to rule that out. The stange thing is that the chunk files do not produce such errors (otherwise verifications would fail I assume).
You could also try the ceph-fuse client instead of the in-kernel one, this would allow to narrow down the issue. See https://docs.ceph.com/en/latest/cephfs/mount-using-fuse/
 
I'm having the same issue, also using kernal based cephfs as a datastore. The issue seems to start on vm's that I have expanded the volumes on. It's inconsistent on some vms, and consistent on a homeassistant vm I have running. Prior to expanding the volume on home assistant backups verified 100% after expanding the volume the backups have failed verification 100%. I'll do a test to see if installing a fresh install of HA doing a backup and then expanding the volume and doing anther backup results in the same behavior.
 
I'm having the same issue, also using kernal based cephfs as a datastore. The issue seems to start on vm's that I have expanded the volumes on. It's inconsistent on some vms, and consistent on a homeassistant vm I have running. Prior to expanding the volume on home assistant backups verified 100% after expanding the volume the backups have failed verification 100%. I'll do a test to see if installing a fresh install of HA doing a backup and then expanding the volume and doing anther backup results in the same behavior.
Just to double check, could you also share the exact error message you get and share the hexdump for that index file, as shown in https://forum.proxmox.com/threads/backup-suceeds-but-ends-up-failing-verification.149249/post-675825?

Also, please share your Kernel version uname -a as currently running on the PBS host and check if the issue persists when mounting the CephFS via the fuse client. Thanks!
 
Please do also check the systemd journal or dmesg output on the PBS host for errors related to CephFS around the time of the backups.
 
Just to double check, could you also share the exact error message you get and share the hexdump for that index file, as shown in https://forum.proxmox.com/threads/backup-suceeds-but-ends-up-failing-verification.149249/post-675825?

Also, please share your Kernel version uname -a as currently running on the PBS host and check if the issue persists when mounting the CephFS via the fuse client. Thanks!

Switching to ceph-fuse seems to fix the issue!


I tested switching back and forth several times, the kernal based ceph errors on varification evey time, whereas ceph-fuse does not. Verification takes significantly longer using ceph-fuse from 20mins to 80min, but my cluster is running on low power/limited memory hardware. I'll keep using cepf-fuse and see if I run into any other issues.

uname -a
Linux pbs 6.8.8-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-3 (2024-07-16T16:16Z) x86_64 GNU/Linux

Kernel ceph --------

HexDump​

head -c100 /mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/drive-efidisk0.img.fidx | hexdump

Bash:
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000020 3bc2 6ead 1c0d 2bdf 9463 acc3 2c80 818c
0000030 4a24 bbe8 c862 1b0b 7a93 5bd9 983c e58d
0000040 0000 0000 0000 0000 0000 0000 0000 0000
*
0000060 0000 0000                         
0000064

od --format=u1 --read-bytes=8 /mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/drive-efidisk0.img.fidx

Bash:
0000000   0   0   0   0   0   0   0   0
0000010

systemd journal​

No errors for ceph just logs showing the volume mounted. Though I did get a lot of PBS unable to parse worker status

journalctl -e | grep ceph
Bash:
Aug 02 17:09:42 pbs systemd[1]: Reached target ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once.
Aug 02 17:09:49 pbs systemd[1]: Mounting mnt-cephfs-hdd\x2dceph\x2dfs.mount - /mnt/cephfs/hdd-ceph-fs...
Aug 02 17:09:49 pbs kernel: Key type ceph registered
Aug 02 17:09:49 pbs kernel: libceph: loaded (mon/osd proto 15/24)
Aug 02 17:09:49 pbs kernel: ceph: loaded (mds proto 32)
Aug 02 17:09:49 pbs kernel: libceph: mon1 (1)10.10.10.13:6789 session established
Aug 02 17:09:49 pbs kernel: libceph: client83638182 fsid 3e4a2951-1a66-42a9-8767-2d207d6eb310
Aug 02 17:09:51 pbs systemd[1]: Mounted mnt-cephfs-hdd\x2dceph\x2dfs.mount - /mnt/cephfs/hdd-ceph-fs.
Aug 02 17:11:14 pbs proxmox-backup-proxy[587]: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/qemu-server.conf.blob" (472 bytes, comp: 472)
Aug 02 17:15:01 pbs proxmox-backup-proxy[587]: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/index.json.blob" (379 bytes, comp: 379)

Bash:
Aug 02 17:33:41 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:33:56 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:34:11 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:34:26 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:34:41 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:34:57 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:35:12 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:35:27 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:35:42 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:35:57 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:36:18 pbs proxmox-backup-proxy[587]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to pars>
Aug 02 17:36:38 pbs login[941]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)

Backup Job Proxmox​

Bash:
INFO: starting new backup job: vzdump 200 --notes-template '{{guestname}}' --mode snapshot --storage proxmox-backup-server --node proxmox01 --notification-mode auto --remove 0
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-08-02 17:11:01
INFO: status = running
INFO: VM Name: HomeAssistant
INFO: include disk 'scsi0' 'ceph-replicated:vm-200-disk-1' 40G
INFO: include disk 'efidisk0' 'ceph-replicated:vm-200-disk-2' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'ceph-replicated:vm-200-disk-0' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/200/2024-08-03T00:11:01Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 200 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '95a4b67b-d73a-493e-b82a-c08422bd2d4a'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 100% (40.0 GiB of 40.0 GiB) in 48s, read: 64.8 MiB/s, write: 64.8 MiB/s
INFO: backup is sparse: 12.39 GiB (30%) total zero data
INFO: backup was done incrementally, reused 12.68 GiB (31%)
INFO: transferred 40.00 GiB in 48 seconds (853.3 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 200 (00:04:01)
INFO: Backup finished at 2024-08-02 17:15:02
INFO: Backup job finished successfully
INFO: notified via target `mail-to-root`
TASK OK

Backup Job PBS​

Bash:
2024-08-02T17:11:14-07:00: starting new backup on datastore 'backups' from ::ffff:10.10.10.12: "vm/200/2024-08-03T00:11:01Z"
2024-08-02T17:11:14-07:00: GET /previous: 400 Bad Request: no valid previous backup
2024-08-02T17:11:14-07:00: created new fixed index 1 ("vm/200/2024-08-03T00:11:01Z/drive-efidisk0.img.fidx")
2024-08-02T17:11:14-07:00: created new fixed index 2 ("vm/200/2024-08-03T00:11:01Z/drive-scsi0.img.fidx")
2024-08-02T17:11:14-07:00: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/qemu-server.conf.blob" (472 bytes, comp: 472)
2024-08-02T17:15:01-07:00: Upload statistics for 'drive-scsi0.img.fidx'
2024-08-02T17:15:01-07:00: UUID: d0fbd2ac375a47329f24f469df9d061a
2024-08-02T17:15:01-07:00: Checksum: 22c66c0c914bfbf992b355db884919652648955692d847e635d81cf5bfa8a41c
2024-08-02T17:15:01-07:00: Size: 42949672960
2024-08-02T17:15:01-07:00: Chunk count: 10240
2024-08-02T17:15:01-07:00: Upload size: 29343350784 (68%)
2024-08-02T17:15:01-07:00: Duplicates: 3244+6702 (97%)
2024-08-02T17:15:01-07:00: Compression: 39%
2024-08-02T17:15:01-07:00: successfully closed fixed index 2
2024-08-02T17:15:01-07:00: Upload statistics for 'drive-efidisk0.img.fidx'
2024-08-02T17:15:01-07:00: UUID: 880cfa6ccc6f4826af60531cf443ef45
2024-08-02T17:15:01-07:00: Checksum: c23bad6e0d1cdf2b6394c3ac802c8c81244ae8bb62c80b1b937ad95b3c988de5
2024-08-02T17:15:01-07:00: Size: 540672
2024-08-02T17:15:01-07:00: Chunk count: 1
2024-08-02T17:15:01-07:00: Upload size: 4734976 (875%)
2024-08-02T17:15:01-07:00: Duplicates: 0+2 (200%)
2024-08-02T17:15:01-07:00: Compression: 0%
2024-08-02T17:15:01-07:00: successfully closed fixed index 1
2024-08-02T17:15:01-07:00: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/index.json.blob" (379 bytes, comp: 379)
2024-08-02T17:15:01-07:00: successfully finished backup
2024-08-02T17:15:01-07:00: backup finished successfully
2024-08-02T17:15:01-07:00: TASK OK

Verify​

Bash:
2024-08-02T17:22:45-07:00: verify backups:vm/200/2024-08-03T00:11:01Z
2024-08-02T17:22:45-07:00:   check qemu-server.conf.blob
2024-08-02T17:22:45-07:00:   check drive-scsi0.img.fidx
2024-08-02T17:34:57-07:00:   verified 11082.30/28312.00 MiB in 731.73 seconds, speed 15.15/38.69 MiB/s (0 errors)
2024-08-02T17:34:57-07:00:   check drive-efidisk0.img.fidx
2024-08-02T17:34:57-07:00: verify backups:vm/200/2024-08-03T00:11:01Z/drive-efidisk0.img.fidx failed: Unable to open fixed index "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:11:01Z/drive-efidisk0.img.fidx" - got unknown magic number
2024-08-02T17:34:57-07:00: Failed to verify the following snapshots/groups:
2024-08-02T17:34:57-07:00:     vm/200/2024-08-03T00:11:01Z
2024-08-02T17:34:57-07:00: TASK ERROR: verification failed - please check the log for details


Using ceph-fuse ------

HexDump​

head -c100 /mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00\:56\:18Z/drive-efidisk0.img.fidx | hexdump
Code:
0000000 7f2f ed41 fd91 cd0f e845 cc5f 6ea7 b44a
0000010 9c82 5c0c afb6 0750 8034 66ad 0000 0000
0000020 3bc2 6ead 1c0d 2bdf 9463 acc3 2c80 818c
0000030 4a24 bbe8 c862 1b0b 7a93 5bd9 983c e58d
0000040 4000 0008 0000 0000 0000 0040 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000
0000060 0000 0000                         
0000064

od --format=u1 --read-bytes=8 /mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00\:56\:18Z/drive-efidisk0.img.fidx
Code:
0000000  47 127  65 237 145 253  15 205
0000010


journalctl -e | grep ceph​

Code:
Aug 02 17:53:03 pbs systemd[1]: Reached target ceph-fuse.target - ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Aug 02 17:53:03 pbs systemd[1]: Reached target ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once.
Aug 02 17:53:09 pbs systemd[1]: Mounting mnt-cephfs-hdd\x2dceph\x2dfs.mount - /mnt/cephfs/hdd-ceph-fs...
Aug 02 17:53:10 pbs mount[590]: ceph-fuse[590]: starting ceph client
Aug 02 17:53:10 pbs proxmox-backup-proxy[555]: read fs info on "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server" failed - ENOENT: No such file or directory
Aug 02 17:53:10 pbs mount[590]: ceph-fuse[590]: starting fuse
Aug 02 17:53:10 pbs systemd[1]: Mounted mnt-cephfs-hdd\x2dceph\x2dfs.mount - /mnt/cephfs/hdd-ceph-fs.

Bash:
Aug 02 17:59:58 pbs proxmox-backup-proxy[555]: unable to parse worker status 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam: 6690D471 OK' - unable to parse UPID 'UPID:pUPID:pbs:00000268:00000DD5:00000050:6690D470:logrotate::root@pam:'

Backup Job Proxmox​


Bash:
INFO: starting new backup job: vzdump 200 --remove 0 --notification-mode auto --node proxmox01 --notes-template '{{guestname}}' --mode snapshot --storage proxmox-backup-server
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-08-02 17:56:18
INFO: status = running
INFO: VM Name: HomeAssistant
INFO: include disk 'scsi0' 'ceph-replicated:vm-200-disk-1' 40G
INFO: include disk 'efidisk0' 'ceph-replicated:vm-200-disk-2' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'ceph-replicated:vm-200-disk-0' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/200/2024-08-03T00:56:18Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '1e507365-31bc-4863-aa30-8f70b0bbd466'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 100% (40.0 GiB of 40.0 GiB) in 6m 15s, read: 71.3 MiB/s, write: 71.3 MiB/s
INFO: backup is sparse: 12.39 GiB (30%) total zero data
INFO: backup was done incrementally, reused 12.66 GiB (31%)
INFO: transferred 40.00 GiB in 376 seconds (108.9 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 200 (00:06:25)
INFO: Backup finished at 2024-08-02 18:02:43
INFO: Backup job finished successfully
INFO: notified via target `mail-to-root`
TASK OK

Backup Job PBS​

Code:
2024-08-02T17:56:20-07:00: starting new backup on datastore 'backups' from ::ffff:10.10.10.12: "vm/200/2024-08-03T00:56:18Z"
2024-08-02T17:56:20-07:00: GET /previous: 400 Bad Request: no valid previous backup
2024-08-02T17:56:20-07:00: created new fixed index 1 ("vm/200/2024-08-03T00:56:18Z/drive-efidisk0.img.fidx")
2024-08-02T17:56:20-07:00: created new fixed index 2 ("vm/200/2024-08-03T00:56:18Z/drive-scsi0.img.fidx")
2024-08-02T17:56:20-07:00: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:56:18Z/qemu-server.conf.blob" (472 bytes, comp: 472)
2024-08-02T18:02:42-07:00: Upload statistics for 'drive-scsi0.img.fidx'
2024-08-02T18:02:42-07:00: UUID: e357b7ffd0174a7b977193e61007214b
2024-08-02T18:02:42-07:00: Checksum: ab1261730c4f822a10d7ef147c88fa54e135b71147681f58a7bd1cc6b8bcc938
2024-08-02T18:02:42-07:00: Size: 42949672960
2024-08-02T18:02:42-07:00: Chunk count: 10240
2024-08-02T18:02:42-07:00: Upload size: 29355933696 (68%)
2024-08-02T18:02:42-07:00: Duplicates: 3241+6528 (95%)
2024-08-02T18:02:42-07:00: Compression: 39%
2024-08-02T18:02:42-07:00: successfully closed fixed index 2
2024-08-02T18:02:42-07:00: Upload statistics for 'drive-efidisk0.img.fidx'
2024-08-02T18:02:42-07:00: UUID: 45e85fcca76e4ab4829c0c5cb6af5007
2024-08-02T18:02:42-07:00: Checksum: c23bad6e0d1cdf2b6394c3ac802c8c81244ae8bb62c80b1b937ad95b3c988de5
2024-08-02T18:02:42-07:00: Size: 540672
2024-08-02T18:02:42-07:00: Chunk count: 1
2024-08-02T18:02:42-07:00: Upload size: 4734976 (875%)
2024-08-02T18:02:42-07:00: Duplicates: 0+2 (200%)
2024-08-02T18:02:42-07:00: Compression: 0%
2024-08-02T18:02:42-07:00: successfully closed fixed index 1
2024-08-02T18:02:42-07:00: add blob "/mnt/cephfs/hdd-ceph-fs/proxmox/proxmox-backup-server/vm/200/2024-08-03T00:56:18Z/index.json.blob" (380 bytes, comp: 380)
2024-08-02T18:02:42-07:00: successfully finished backup
2024-08-02T18:02:42-07:00: backup finished successfully
2024-08-02T18:02:42-07:00: TASK OK

Verify​

Code:
2024-08-02T18:05:38-07:00: verify backups:vm/200/2024-08-03T00:56:18Z
2024-08-02T18:05:38-07:00:   check qemu-server.conf.blob
2024-08-02T18:05:38-07:00:   check drive-scsi0.img.fidx
2024-08-02T19:38:54-07:00:   verified 11093.97/28300.00 MiB in 5595.32 seconds, speed 1.98/5.06 MiB/s (0 errors)
2024-08-02T19:38:54-07:00:   check drive-efidisk0.img.fidx
2024-08-02T19:38:54-07:00:   verified 0.01/0.52 MiB in 0.07 seconds, speed 0.10/7.87 MiB/s (0 errors)
2024-08-02T19:38:54-07:00: TASK OK
 
Last edited:
  • Like
Reactions: Chris and fiona
Thanks a lot for the feedback, so this now clearly points towards an issue with the in-kernel cephfs client.

If you are willing, please try to debug the issue further by enabling dynamic debugging output. To do this, please use a non production PBS instance with enough free space, as otherwise you might run into issues because of the root disk of the instance running full. Also, there should nothing else interact with the cephfs client, so to get a clean debug trace. Also, if you manage to have a VM with small disks, and therefore reduced backup runtime, that would help limiting the amount of output.

First you should reproduce the issue there, then you can clear the snapshots and enable the dynamic debugging output for ceph and libceph via a small shell script like so
Bash:
#!/bin/sh -x

p() {
 echo "$*" > /sys/kernel/debug/dynamic_debug/control
}

echo 9 > /proc/sysrq-trigger
p 'module ceph +p'
p 'module libceph +p'

This is based on the script as found here [0, 1].

Then start the backup run. Once finished, please immediately disable the dynamic debugging output again via another script
Bash:
#!/bin/sh -x

p() {
 echo "$*" > /sys/kernel/debug/dynamic_debug/control
}

echo 0 > /proc/sysrq-trigger
p 'module ceph -p'
p 'module libceph -p'

Finally, please collect the output of the systemd journal for the full backup run by dumping it into a file via
Bash:
journalctl -S today -k > jorunalctl.dump
and send me the dump, best as private message.

I do hope that will help to further narrow down where the issue is.

[0] https://docs.ceph.com/en/latest/cephfs/troubleshooting/#dynamic-debugging
[1] https://github.com/ceph/ceph/blob/main/src/script/kcon_all.sh

Edit: Fixed typo
 
Last edited:
  • Like
Reactions: fiona
I switched back to kernal ceph this morning and the error is no longer happening. Between the issue happening and now a scheduled garbage collection on PBS ran (the backups with verification issues prevented the garbage collection form completing previously). I'll do some tests and see if I can get it to start failing varification again to get a debug log for you.
 
I switched back to kernal ceph this morning and the error is no longer happening. Between the issue happening and now a scheduled garbage collection on PBS ran (the backups with verification issues prevented the garbage collection form completing previously). I'll do some tests and see if I can get it to start failing varification again to get a debug log for you.
Thanks, the garbage collection should however be unrelated.

Let me give some more insights on the current hypothesis: The suspicion is that there is a cephfs bug related to caching, making this unfortunately also more difficult to reproduce and debug. This is also why the backup is successful, as checksums and sizes written to disk as seen by the proxmox backup client do match the expected ones, while the underlying filesystem seems to not persist the initial writing of the header, only writing the final bytes when updating the index checksum. This is why the full 4k page is zeroed, except for the checksum written when finishing and closing the fixed index file. Also, so far this seems to only happen with VMs having more than 1 disk attached, which indicates that only a certain sequence of operations and timings leads to this.

So comparing the dynamic debug output of the boguos run with a good run should help to see in the source code where things go wrong.
 
  • Like
Reactions: Dunuin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!