Ceph errors in dmesg: "bad crc/signature", "socket closed", "read_partial_message ... data crc". Network issue?

broth-itk · Jul 14, 2025

Dear community members!

I wonder if I need to worry about ceph messages in dmesg of all three nodes of the cluster:

These are all DL385 Gen10 Servers with SAS SSDs on internal SAS controller in HBA mode (P408i).
It's somehow interesting that pve-s2 does show only "socket error on write" and "socket closed" errors.
s3 and s4 report both crc/signature and red read_partial_message errors.

Here is my system information:

pveversion -v

Code:

proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
frr-pythontools: 10.2.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.7
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.14
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

ceph versions

Code:

{
    "mon": {
        "ceph version 19.2.1 (c783d93f19f71de89042abf6023076899b42259d) squid (stable)": 3
    },
    "mgr": {
        "ceph version 19.2.1 (c783d93f19f71de89042abf6023076899b42259d) squid (stable)": 3
    },
    "osd": {
        "ceph version 19.2.1 (c783d93f19f71de89042abf6023076899b42259d) squid (stable)": 12
    },
    "mds": {
        "ceph version 19.2.1 (c783d93f19f71de89042abf6023076899b42259d) squid (stable)": 3
    },
    "overall": {
        "ceph version 19.2.1 (c783d93f19f71de89042abf6023076899b42259d) squid (stable)": 21
    }
}

Ceph backend networking is done with dual 25GbE (LACP trunk):

pve-s2: both backend links on same mellanox connextx-5 nic
pve-s2 & s3: one link on mellanox connextx-5 nic, other on broadcom BCM57414 NetXtreme-E 10Gb/25Gb

Code:

auto bond2
iface bond2 inet static
    address 10.1.28.22/24
    bond-slaves ens10f0np0 ens2f0np0
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000
#Ceph #1 25GbE

While writing this post I notice that one of the key differences is the broadcom nic in servers s3 and s4.
It was initially put there on purpose since we had it available and to minimize the risk of a driver or hardware failure (SPOF).

Attached the huge output of "ethtool -S" for each of the 25GbE interfaces.

I have a strong feeling that the problem is network related but have not yet found the root cause nor a solution.

Thanks for your assistance and guidance!

Best regards,
Bernhard

broth-itk · Jul 14, 2025

For some reason there are numerous tpa_aborts on the broadcom NIC:

I'm going to switch off LRO, reset dmesg and take a look tomorrow:

Code:

ethtool -K ens10f0np0 lro off

Firmware of BCM nic is the latest HPE has to offer.

broth-itk · Jul 15, 2025

Unfortunately the problems persist.
Any assistance greatly appreciated

broth-itk · Sep 17, 2025

Now with kernel 6.8.12-15-pve:

Code:

[ 1911.603381] libceph: read_partial_message 00000000ad0a6e6a data crc 3672428950 != exp. 1376189682
[ 1911.603416] libceph: read_partial_message 0000000045900922 data crc 1640475380 != exp. 1097260932
[ 1911.603957] libceph: osd8 (1)10.1.28.21:6811 bad crc/signature
[ 1911.604353] libceph: osd4 (1)10.1.28.23:6811 bad crc/signature

It might be related to backup with causes heavy I/O on Ceph beyond 1GiB/s.

What are suitable steps to find the root cause of that issue?

Search

Search

Ceph errors in dmesg: "bad crc/signature", "socket closed", "read_partial_message ... data crc". Network issue?

broth-itk

Member

Attachments

broth-itk

Member

broth-itk

Member

broth-itk

Member

We value your privacy