Ceph - 'bad crc/signature' and 'socket closed'

Jun 8, 2016
344
74
93
48
Johannesburg, South Africa
Could anyone provide assistance for me to track down possible causes of the following events occurring on PVE 5.1 with Ceph Luminous 12.2.2?

We are running 6 nodes where each has 4 hdd OSDs with journals on ssd (2:1 ratio). ie: 2 x ssd (Proxmox OS on software raid1 ssd partitions and 4 partions on ssd drives being used as journals for spinners).


One node will report the following (kvm5d):
Code:
/var/log/messages:
Dec 14 10:38:29 kvm5d kernel: [1024830.709128] libceph: osd22 10.254.1.7:6804 socket closed (con state OPEN)


Whilst the node servicing that OSD reports the following (kvm5f):
Code:
/var/log/ceph/ceph-osd.22.log <==
2017-12-14 10:38:29.917858 7fee5b400700  0 bad crc in data 1880483579 != exp 1064005293



I don't believe this to relate to networking, as I have events that appear to be local:
Code:
==> /var/log/ceph/ceph-osd.12.log <==
2017-12-14 10:52:01.648167 7f9cfed00700  0 bad crc in data 649733771 != exp 7543965
--
==> /var/log/messages <==
Dec 14 10:52:01 kvm5d kernel: [1025642.441870] libceph: osd12 10.254.1.5:6801 socket closed (con state OPEN)



PS: I ran the following on our nodes concurrently to catch events:
Code:
tail -f /var/log/messages /var/log/ceph/ceph-osd.*.log | grep -B 1 --color 'crc\|socket closed'
 
All available updates are installed on all 6 nodes, two have been rebooted to be running 4.13.8-3-pve whereas the other 4 are still running 4.13.8-2-pve.

Code:
[admin@kvm5e ~]# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.13.8-3-pve: 4.13.8-30
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1
 
These errors occur roughly every 2 minutes and appear to be completely random in distribution.

Herewith Ceph health status:
Code:
[root@kvm5b ~]# ceph -s
  cluster:
    id:     a3f1c21f-f883-48e0-9bd2-4f869c72b17d
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 3 daemons, quorum 1,2,3
    mgr: kvm5c(active), standbys: kvm5b, kvm5d
    mds: cephfs-1/1/1 up  {0=kvm5c=up:active}, 2 up:standby
    osd: 24 osds: 24 up, 24 in
         flags noout

  data:
    pools:   3 pools, 592 pgs
    objects: 1366k objects, 5272 GB
    usage:   15787 GB used, 28903 GB / 44690 GB avail
    pgs:     592 active+clean

  io:
    client:   1189 kB/s rd, 15487 kB/s wr, 491 op/s rd, 1773 op/s wr
 
Are you connecting to a pool with krbd? It could come from different kernel versions, try to bring all of your machines to the same kernel version.
 
Hi David,

I know this thread is old, but are you still experiencing this behaviour on your cluster, as we are seeing very similar behaviour on our setup.

KVM's are running latest kernels and ceph cluster is running ceph Jewel on Centos 7.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!