Ceph osd changes status: (down,up)

Addspin

Active Member
Dec 20, 2016
14
0
41
38
Hi, on one of the ceph cluster nodes a message appeared: 1 osds down, it appears and then disappears, the status constantly changes from down to up, what can be done about it?
The SMART status of the disk shows OK.

Package version:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.5-pve1

Log ceph attached.
 

Attachments

Last edited:
Hi,

your system is quite outdated.
We support only the current Proxmox VE version.
So upgrade to the current version.
 
It would be nice but unfortunately there is no opportunity to quickly upgrade, tell me what to look for?
 
2019-04-17 09:50:38.814992 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336538 : cluster [INF] osd.11 failed (root=default,host=s1300pve02) (connection refused reported by osd.13)
2019-04-17 09:50:39.228431 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336572 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-17 09:50:40.264827 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336574 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-04-17 09:50:41.667272 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336576 : cluster [WRN] mon.2 100.13.100.3:6789/0 clock skew 0.0599787s > max 0.05s
2019-04-17 09:50:42.294795 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336577 : cluster [WRN] Health check failed: Degraded data redundancy: 40343/2272848 objects degraded (1.775%), 27 pgs degraded (PG_DEGRADED)
2019-04-17 09:50:46.177357 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336578 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
2019-04-17 09:50:48.543115 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336579 : cluster [WRN] Health check update: Degraded data redundancy: 130699/2272848 objects degraded (5.750%), 88 pgs degraded (PG_DEGRADED)
2019-04-17 09:51:10.125635 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336582 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-17 09:51:10.169342 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336583 : cluster [INF] osd.11 100.13.100.2:6808/1111592 boot
This looks like hardware is failing. Please check that the NIC, PSU, HDD ... of that server is functioning properly.