Ceph osd changes status: (down,up)

Addspin · Apr 17, 2019

Hi, on one of the ceph cluster nodes a message appeared: 1 osds down, it appears and then disappears, the status constantly changes from down to up, what can be done about it?
The SMART status of the disk shows OK.

Package version:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.5-pve1

Log ceph attached.

wolfgang · Apr 17, 2019

Hi,

your system is quite outdated.
We support only the current Proxmox VE version.
So upgrade to the current version.

Addspin · Apr 17, 2019

It would be nice but unfortunately there is no opportunity to quickly upgrade, tell me what to look for?

Alwin · Apr 18, 2019

2019-04-17 09:50:38.814992 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336538 : cluster [INF] osd.11 failed (root=default,host=s1300pve02) (connection refused reported by osd.13)
2019-04-17 09:50:39.228431 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336572 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-17 09:50:40.264827 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336574 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-04-17 09:50:41.667272 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336576 : cluster [WRN] mon.2 100.13.100.3:6789/0 clock skew 0.0599787s > max 0.05s
2019-04-17 09:50:42.294795 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336577 : cluster [WRN] Health check failed: Degraded data redundancy: 40343/2272848 objects degraded (1.775%), 27 pgs degraded (PG_DEGRADED)
2019-04-17 09:50:46.177357 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336578 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
2019-04-17 09:50:48.543115 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336579 : cluster [WRN] Health check update: Degraded data redundancy: 130699/2272848 objects degraded (5.750%), 88 pgs degraded (PG_DEGRADED)
2019-04-17 09:51:10.125635 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336582 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-17 09:51:10.169342 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336583 : cluster [INF] osd.11 100.13.100.2:6808/1111592 boot

This looks like hardware is failing. Please check that the NIC, PSU, HDD ... of that server is functioning properly.

Addspin · Apr 19, 2019

Yes, i confirm the problem with the hard disk, thank you!

Search

Search

Ceph osd changes status: (down,up)

Addspin

Active Member

Attachments

wolfgang

Proxmox Retired Staff

Addspin

Active Member

Alwin

Proxmox Retired Staff

Addspin

Active Member