ceph 19.2.1 - syslog shows "bad crc/signature" when KRBD is activated in /etc/pve/storage.cfg

S.H.

New Member
Feb 18, 2025
1
0
1
Hey, we've encountered some different behavior from our ceph cluster running squid 19.2.1 bluestore.

We have two ceph storage pools running in a strech cluster configuration across two defined datacenters "datacenter1" and "datacenter2" with 3 nodes per datacenter within a 25G SFP+ uplink bond in LACP mode running on its own vlan without any issues at first.

Then we removed a disk as simulated disk failure, putted it back in and cluster was running fine. The next morning our colleague wanted to RDP to one of the fresh installed windows server 2025 VM we installed that day before.

After some checking on proxmox webgui we've seen multiple errors where the guest was unresponsive.
pve-ha-lrm[1314815]: can't map rbd volume vm-4022-disk-0: rbd: sysfs write failed
serverB2 pve-ha-lrm[1314634]: <root@pam> end task UPID:serverB2:00140F4B:028CE634:686E35F9:qmstart:4022:root@pam: can't map rbd volume vm-4022-disk-0: rbd: sysfs write failed
It was not only this VM but also guest which are specific on this host with I/O errors.
The physical disk has no issues nor problems. I've checked it with nvme-cli and also iDRAC says everything is in a good condition.

We troubleshooted multiple hours with host restart, vm migration, removing this particular OSD adding it back again. We had different weights in our bucket. Reweighted this particular OSD, then the entire bucket to its original weight. Everything alright, ceph started to rebalance/recover.

Migrating the windows server 2025 vm back to serverB2, continued with our tasks, 5 to 10 minutes later (really dont know anymore) - RDP connection was lost and same symptoms again. We started so see continuous spamming the following error message into syslog.

serverX kernel: [428232.120381] libceph: osdX (1)x.x.x.x:6809 bad crc/signature
serverX kernel: [428232.120376] libceph: read_partial_message 00000000da4b413f data crc 3653550545 != exp. 3231240447

Then we disabled KRBD in storage.cfg via webgui for this particular pool since we have two, just to check if the error persist. Left it over night and checked in today without any issues.

Second pool has some features set we need for so KRBD is enabled

We also want to use features for the first pool with now disabled KRBD.

Is there a way to check if this is a bug or just an ceph error as it was in a state it should not be?
We've checked on that host "serverB2" all ceph logs - Besides OSD.13 which was the one which lead to all this(?)
Ceph cluster is healty now and was healty yesterday too.


All 6 nodes + quorum only (7 in total) using the same enterprise repository and configured the same.

Code:
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

Maybe I'm missing something.

Thanks a lot! If there are any questions or misunderstanding from my writing, please!
 
Last edited: