Could anyone provide assistance for me to track down possible causes of the following events occurring on PVE 5.1 with Ceph Luminous 12.2.2?
We are running 6 nodes where each has 4 hdd OSDs with journals on ssd (2:1 ratio). ie: 2 x ssd (Proxmox OS on software raid1 ssd partitions and 4 partions on ssd drives being used as journals for spinners).
One node will report the following (kvm5d):
Whilst the node servicing that OSD reports the following (kvm5f):
I don't believe this to relate to networking, as I have events that appear to be local:
PS: I ran the following on our nodes concurrently to catch events:
We are running 6 nodes where each has 4 hdd OSDs with journals on ssd (2:1 ratio). ie: 2 x ssd (Proxmox OS on software raid1 ssd partitions and 4 partions on ssd drives being used as journals for spinners).
One node will report the following (kvm5d):
Code:
/var/log/messages:
Dec 14 10:38:29 kvm5d kernel: [1024830.709128] libceph: osd22 10.254.1.7:6804 socket closed (con state OPEN)
Whilst the node servicing that OSD reports the following (kvm5f):
Code:
/var/log/ceph/ceph-osd.22.log <==
2017-12-14 10:38:29.917858 7fee5b400700 0 bad crc in data 1880483579 != exp 1064005293
I don't believe this to relate to networking, as I have events that appear to be local:
Code:
==> /var/log/ceph/ceph-osd.12.log <==
2017-12-14 10:52:01.648167 7f9cfed00700 0 bad crc in data 649733771 != exp 7543965
--
==> /var/log/messages <==
Dec 14 10:52:01 kvm5d kernel: [1025642.441870] libceph: osd12 10.254.1.5:6801 socket closed (con state OPEN)
PS: I ran the following on our nodes concurrently to catch events:
Code:
tail -f /var/log/messages /var/log/ceph/ceph-osd.*.log | grep -B 1 --color 'crc\|socket closed'