Ceph - 'bad crc/signature' and 'socket closed'

David Herselman · Dec 14, 2017

Could anyone provide assistance for me to track down possible causes of the following events occurring on PVE 5.1 with Ceph Luminous 12.2.2?

We are running 6 nodes where each has 4 hdd OSDs with journals on ssd (2:1 ratio). ie: 2 x ssd (Proxmox OS on software raid1 ssd partitions and 4 partions on ssd drives being used as journals for spinners).

One node will report the following (kvm5d):

Code:

/var/log/messages:
Dec 14 10:38:29 kvm5d kernel: [1024830.709128] libceph: osd22 10.254.1.7:6804 socket closed (con state OPEN)

Whilst the node servicing that OSD reports the following (kvm5f):

Code:

/var/log/ceph/ceph-osd.22.log <==
2017-12-14 10:38:29.917858 7fee5b400700  0 bad crc in data 1880483579 != exp 1064005293

I don't believe this to relate to networking, as I have events that appear to be local:

Code:

==> /var/log/ceph/ceph-osd.12.log <==
2017-12-14 10:52:01.648167 7f9cfed00700  0 bad crc in data 649733771 != exp 7543965
--
==> /var/log/messages <==
Dec 14 10:52:01 kvm5d kernel: [1025642.441870] libceph: osd12 10.254.1.5:6801 socket closed (con state OPEN)

PS: I ran the following on our nodes concurrently to catch events:

Code:

tail -f /var/log/messages /var/log/ceph/ceph-osd.*.log | grep -B 1 --color 'crc\|socket closed'

David Herselman · Dec 14, 2017

All available updates are installed on all 6 nodes, two have been rebooted to be running 4.13.8-3-pve whereas the other 4 are still running 4.13.8-2-pve.

Code:

[admin@kvm5e ~]# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.13.8-3-pve: 4.13.8-30
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

David Herselman · Dec 14, 2017

These errors occur roughly every 2 minutes and appear to be completely random in distribution.

Herewith Ceph health status:

Code:

[root@kvm5b ~]# ceph -s
  cluster:
    id:     a3f1c21f-f883-48e0-9bd2-4f869c72b17d
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 3 daemons, quorum 1,2,3
    mgr: kvm5c(active), standbys: kvm5b, kvm5d
    mds: cephfs-1/1/1 up  {0=kvm5c=up:active}, 2 up:standby
    osd: 24 osds: 24 up, 24 in
         flags noout

  data:
    pools:   3 pools, 592 pgs
    objects: 1366k objects, 5272 GB
    usage:   15787 GB used, 28903 GB / 44690 GB avail
    pgs:     592 active+clean

  io:
    client:   1189 kB/s rd, 15487 kB/s wr, 491 op/s rd, 1773 op/s wr

Alwin · Dec 14, 2017

Are you connecting to a pool with krbd? It could come from different kernel versions, try to bring all of your machines to the same kernel version.

David Herselman · Dec 14, 2017

Yes, we exclusively use krbd but two of the 6 are booted with latest kernel...

Are 'bad crc in data' errors a problem reading from the local disc or a network issue?

Alwin · Dec 14, 2017

I had trouble once, with trim, somewhat as in that mail: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021281.html
But we need more logging to see, what happens before the crc failure. http://docs.ceph.com/docs/master/ra...g-and-debug/#subsystem-log-and-debug-settings

IzakEygelaar · Feb 28, 2019

Hi David,

I know this thread is old, but are you still experiencing this behaviour on your cluster, as we are seeing very similar behaviour on our setup.

KVM's are running latest kernels and ceph cluster is running ceph Jewel on Centos 7.

David Herselman · Mar 4, 2019

We're running Ceph Luminous with latest updates and no longer observe these errors

Search

Search

Ceph - 'bad crc/signature' and 'socket closed'

David Herselman

Renowned Member

David Herselman

Renowned Member

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

IzakEygelaar

Member

David Herselman

Renowned Member