Ceph osd changes status: (down,up)

Addspin

Active Member
Dec 20, 2016
14
0
41
37
Hi, on one of the ceph cluster nodes a message appeared: 1 osds down, it appears and then disappears, the status constantly changes from down to up, what can be done about it?
The SMART status of the disk shows OK.

Package version:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.5-pve1

Log ceph attached.
 

Attachments

  • log.txt
    3.1 KB · Views: 12
Last edited:
Hi,

your system is quite outdated.
We support only the current Proxmox VE version.
So upgrade to the current version.
 
It would be nice but unfortunately there is no opportunity to quickly upgrade, tell me what to look for?
 
2019-04-17 09:50:38.814992 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336538 : cluster [INF] osd.11 failed (root=default,host=s1300pve02) (connection refused reported by osd.13)
2019-04-17 09:50:39.228431 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336572 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-17 09:50:40.264827 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336574 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-04-17 09:50:41.667272 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336576 : cluster [WRN] mon.2 100.13.100.3:6789/0 clock skew 0.0599787s > max 0.05s
2019-04-17 09:50:42.294795 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336577 : cluster [WRN] Health check failed: Degraded data redundancy: 40343/2272848 objects degraded (1.775%), 27 pgs degraded (PG_DEGRADED)
2019-04-17 09:50:46.177357 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336578 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
2019-04-17 09:50:48.543115 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336579 : cluster [WRN] Health check update: Degraded data redundancy: 130699/2272848 objects degraded (5.750%), 88 pgs degraded (PG_DEGRADED)
2019-04-17 09:51:10.125635 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336582 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-17 09:51:10.169342 mon.s1300pve01 mon.0 100.13.100.1:6789/0 1336583 : cluster [INF] osd.11 100.13.100.2:6808/1111592 boot
This looks like hardware is failing. Please check that the NIC, PSU, HDD ... of that server is functioning properly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!