ISOs uploaded to PVE via GUI became corrupted on cephfs. . .

Jan 21, 2022
37
3
13
50
We uploaded two ISOs on June 11 via the proxmox GUI and made use of the checksum matching feature while doing so.

Code:
root@pve1:/var/log/pve/tasks# pvenode task list --type imgcopy
┌──────────────────────────────────────────────────────────────┬─────────┬────┬───────────────┬────────────┬────────────┬────────┐
│ UPID                                                         │ Type    │ ID │ User          │  Starttime │    Endtime │ Status │
╞══════════════════════════════════════════════════════════════╪═════════╪════╪═══════════════╪════════════╪════════════╪════════╡
│ UPID:pve1:0012779F:277C022A:66686D36:imgcopy::redacted@pve:  │ imgcopy │    │ redacted@pve  │ 1718119734 │ 1718119750 │ OK     │
├──────────────────────────────────────────────────────────────┼─────────┼────┼───────────────┼────────────┼────────────┼────────┤
│ UPID:pve1:0012CDF8:277DA7EC:6668716D:imgcopy::redacted@pve:  │ imgcopy │    │ redacted@pve  │ 1718120813 │ 1718120824 │ OK     │
├──────────────────────────────────────────────────────────────┼─────────┼────┼───────────────┼────────────┼────────────┼────────┤

Code:
root@pve1:/var/log/pve/tasks# pvenode task log UPID:pve1:0012779F:277C022A:66686D36:imgcopy::redacted@pve:
starting file import from: /var/tmp/pveupload-7199be1779e22d8a81f11dac849eda05
calculating checksum...OK, checksum verified
target node: pve1
target file: /mnt/pve/cephfs/template/iso/Rocky-8.10-x86_64-minimal.iso
file size is: 2694053888
command: cp -- /var/tmp/pveupload-7199be1779e22d8a81f11dac849eda05 /mnt/pve/cephfs/template/iso/Rocky-8.10-x86_64-minimal.iso
finished file import successfully
TASK OK

Code:
root@pve1:/var/log/pve/tasks# pvenode task log UPID:pve1:0012CDF8:277DA7EC:6668716D:imgcopy::redacted@pve:
starting file import from: /var/tmp/pveupload-7c584847348af7323f4b9506856bb773
calculating checksum...OK, checksum verified
target node: pve1
target file: /mnt/pve/cephfs/template/iso/Rocky-9.4-x86_64-minimal.iso
file size is: 1829634048
command: cp -- /var/tmp/pveupload-7c584847348af7323f4b9506856bb773 /mnt/pve/cephfs/template/iso/Rocky-9.4-x86_64-minimal.iso
finished file import successfully
TASK OK
  • Proxmox GUI uploader reported that the hashes matched upon original uploading.
  • We have located the original ISOs that were uploaded and they are still intact and match the checksums provided by rockylinux. So the originals are good.
Yet, upon first use of the ISOs several days (July 15) later, we noticed that they aren't, in fact, intact.

When comparing the corrupted ISOs to the verified/original ISOs, we see the following files differ:

Rocky-8.10-x86_64-minimal.iso
Code:
media.repo
BaseOS/repodata/5a3a9e9fc6a304fdf3a12a4fc8f37fd4efd76524fcd808a060139147308d7a41-primary.xml.gz
BaseOS/repodata/6e26cc2b8c46d5e2c47fe9892f436e48353c750873082a3b9b07132b09abcb40-other.xml.gz
BaseOS/repodata/71f62d6dadfbf3238ce701da43cb69958ce4c546cc370f92e70ba933f3193c23-comps-BaseOS.x86_64.xml
BaseOS/repodata/e105891d2832b712e68b45a603e895845e4df1c99d988936f02d3e899f68b5e5-comps-BaseOS.x86_64.xml.xz
BaseOS/repodata/repomd.xml
Minimal/repodata/0a0ee3d6de957f97960893014ede3f247303f7770819f3ecf9ae30beed45675e-comps-Minimal.x86_64.xml.xz
Minimal/repodata/1cb61ea996355add02b1426ed4c1780ea75ce0c04c5d1107c025c3fbd7d8bcae-primary.xml.gz
Minimal/repodata/22305a97eed1bed923f2cfa37086b208bc9ebcc1e4426384efff558576f40edd-other.sqlite.xz
Minimal/repodata/2b13cd3f9d81647fd31aa16de1b16b582efd9566f8c4334e4561a030f3777c37-comps-Minimal.x86_64.xml
Minimal/repodata/3e3eaeee784726c6a95c8b0b4b776eeb0adef3c9f88bc94df600e571dd030e0c-primary.sqlite.xz
Minimal/repodata/8a1d161ad47cce30bb3c704a541481224c9d490f98f9edb3980d1793922df099-filelists.sqlite.xz
Minimal/repodata/95a4415d859d7120efb6b3cf964c07bebbff9a5275ca673e6e74a97bcbfb2a5f-filelists.xml.gz
Minimal/repodata/ef3e20691954c3d1318ec3071a982da339f4ed76967ded668b795c9e070aaab6-other.xml.gz
Minimal/repodata/repomd.xml

Rocky-9.4-x86_64-minimal.iso
Code:
minimal/repodata/bd201f63f99e67d65f859f38ab472022f055238d74c78c6dd407ef57c4f0f90d-primary.sqlite.bz2
minimal/repodata/d250f7f881bb991be3648c021fb305dd6085b902321b26f52033500ebff7cae1-x86_64.xml.gz
minimal/repodata/repomd.xml

In both cases, the xml files above had their contents replaced by null values (^@^@^@^@^@^@^@) while the original file size was retained, and "noeol" (no end of line) is present. In the case of the compressed files, none could be decompressed fully, but zcat-ing them to a text file revealed that some were somewhat intact, but then would truncate many lines prematurely, while again, retaining the original file size. For example the file "5a3a9e9fc6a304fdf3a12a4fc8f37fd4efd76524fcd808a060139147308d7a41-primary.xml.gz" when zcat-ed out, ends abruptly at line 120,607 whereas the original file is 137,636 lines. Presumably, null values make up the difference (and zcat doesn't output them).

Test uploads since this has been noticed have been successful and verified to be intact. However, we upgraded our proxmox environment to latest on June 18. So we are not now testing on the same version of proxmox which performed the uploads originally.

Obviously, this gives us concern. The upload apparently succeeded and checksum was matched. Can we trust that checksum matching code? Even worse to consider: What could possibly cause these ISOs sitting in cephfs, to become corrupted spontaneously, in situ?

Might there be any known bugs or issues with either the uploader, the checksum matching, or ceph that might account for this? Otherwise, we're pretty concerned about what appears to be spontaneous data corruption in our ceph cluster that otherwise reports as healthy and we're otherwise having no issues.

Thanks!

--Brian
 
Last edited:
Located another corrupt ISO today. It was uploaded Feb 17, 2024. Can't say when the corruption occurred. But, given the findings above, I have no reason not to suspect that it happened after the file was uploaded and its checksum verified, like the other two.

Having found this, I've now sha256sum-ed our entire /mnt/pve/cephfs/template/iso directory and audited all the ISOs therein. I'll keep an eye on it to make sure that nothing else happens. As it stands, no files uploaded prior to Feb 17, 2024 or after June 11 have become corrupted. So, totally at a loss to explain what happened during that period, but files uploaded before and after that period appear fine and have remained fine.
 
Obviously, this gives us concern. The upload apparently succeeded and checksum was matched. Can we trust that checksum matching code? Even worse to consider: What could possibly cause these ISOs sitting in cephfs, to become corrupted spontaneously, in situ?

Might there be any known bugs or issues with either the uploader, the checksum matching, or ceph that might account for this? Otherwise, we're pretty concerned about what appears to be spontaneous data corruption in our ceph cluster that otherwise reports as healthy and we're otherwise having no issues.
Hi,
this might be unrelated, but we are currently investigating a rather rare and hard to reproduce issue with the in-kernel cephfs client, the hypothesis so far seems to be an issue with some pages written and cached not being persisted to disk. So far there were only reports about failing backup verifications when using Proxmox Backup Server with the datastore being located on a CephFS. Also for that case, the initially calculated checksums are fine, but when accessing the file later, data corruption was detected. Please let me refer you to the debugging efforts in this thread [0].

A possible workaround for the time being is to use the CephFS FUSE client rather than the in-kernel client, which can be set in the storage configuration [1].

[0] https://forum.proxmox.com/threads/backup-suceeds-but-ends-up-failing-verification.149249/
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#storage_cephfs
 
Hi,
is this a PVE-managed Ceph cluster or an external one? Please share the output of pveversion -v at least from the node you made the uploads with.
 
Hello!

Thank you for the replies!

Yes, this is a PVE-managed Ceph cluster.

pveversion output follows. . .

Code:
root@pve1:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.4: 6.4-10
pve-kernel-5.3: 6.1-6
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.151-1-pve: 5.4.151-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

That, of course, is the version today. I can't say exactly what version was in place on the days we uploaded the affected ISOs. However, I can provide this timeline so that it might be possible to reconstruct the versions likely in place:
  • Dec 28, 2023 -- upgraded all PVE nodes to pve 8.x from pve 7.x.
  • Feb 17, 2024 -- virtio-win-0.1.240.iso uploaded, later found to be corrupt.
  • March 26, 2024 -- All PVE nodes updated to latest subscription version.
  • June 11, 2024 -- two Rocky ISOs uploaded, later found to be corrupt.
  • June 18, 2024 -- All PVE nodes updated to latest subscription version. < we are here
Again, thank you for the replies! It's nice to know that there's a possible explanation for what we've seen!

--B
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!