ceph osd Health check failed

farway

New Member
Apr 1, 2018
1
0
1
34
HI,everyone:
We have three hosts installed pve operating system, Each host has 4 hard drives (two hard-configured raid1s, installed operating system, and two as osd disk)。the ceph size set three and min_size set two。
tody,Osd status was found down today。

The Proxmox vesrion:
5.1-3

The ceph version is :
ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416) luminous (stable)



The ceph config is:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 11.11.11.0/26
fsid = de946c12-f7c1-43c3-9f1e-b8b7ffd63309
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 11.11.11.0/26

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.node03]
host = node03
mon addr = 11.11.11.3:6789

[mon.node01]
host = node01
mon addr = 11.11.11.1:6789

[mon.node02]
host = node02
mon addr = 11.11.11.2:6789

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.273
item osd.1 weight 0.273
}
host node02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.273
item osd.3 weight 0.273
}
host node03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.4 weight 0.273
item osd.5 weight 0.273
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 1.636
alg straw2
hash 0 # rjenkins1
item node01 weight 0.545
item node02 weight 0.545
item node03 weight 0.545
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

the log is:
2018-04-01 00:00:00.000133 mon.node01 mon.0 11.11.11.1:6789/0 30733 : cluster [WRN] overall HEALTH_WARN clock skew detected on mon.node02, mon.node03
2018-04-01 00:03:08.478121 mon.node01 mon.0 11.11.11.1:6789/0 30734 : cluster [INF] osd.4 marked down after no beacon for 902.439619 seconds
2018-04-01 00:03:08.478159 mon.node01 mon.0 11.11.11.1:6789/0 30735 : cluster [INF] osd.5 marked down after no beacon for 904.439972 seconds
2018-04-01 00:03:08.479389 mon.node01 mon.0 11.11.11.1:6789/0 30736 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2018-04-01 00:03:08.479415 mon.node01 mon.0 11.11.11.1:6789/0 30737 : cluster [WRN] Health check failed: 1 host (2 osds) down (OSD_HOST_DOWN)
2018-04-01 00:03:10.338185 mon.node01 mon.0 11.11.11.1:6789/0 30739 : cluster [WRN] Health check failed: Reduced data availability: 108 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:16.337051 mon.node01 mon.0 11.11.11.1:6789/0 30740 : cluster [WRN] Health check update: Reduced data availability: 216 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:22.338847 mon.node01 mon.0 11.11.11.1:6789/0 30741 : cluster [WRN] Health check update: Reduced data availability: 222 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:23.480597 mon.node01 mon.0 11.11.11.1:6789/0 30742 : cluster [INF] osd.2 marked down after no beacon for 903.436737 seconds
2018-04-01 00:03:23.480644 mon.node01 mon.0 11.11.11.1:6789/0 30743 : cluster [INF] osd.3 marked down after no beacon for 903.436837 seconds
2018-04-01 00:03:23.481937 mon.node01 mon.0 11.11.11.1:6789/0 30744 : cluster [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-04-01 00:03:23.481970 mon.node01 mon.0 11.11.11.1:6789/0 30745 : cluster [WRN] Health check update: 2 hosts (4 osds) down (OSD_HOST_DOWN)
2018-04-01 00:03:28.482648 mon.node01 mon.0 11.11.11.1:6789/0 30747 : cluster [WRN] Health check update: Reduced data availability: 236 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:30.345950 mon.node01 mon.0 11.11.11.1:6789/0 30748 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:33.482927 mon.node01 mon.0 11.11.11.1:6789/0 30749 : cluster [WRN] Health check update: Reduced data availability: 237 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:36.344846 mon.node01 mon.0 11.11.11.1:6789/0 30750 : cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:46.348479 mon.node01 mon.0 11.11.11.1:6789/0 30751 : cluster [WRN] Health check update: Reduced data availability: 240 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:46.348509 mon.node01 mon.0 11.11.11.1:6789/0 30752 : cluster [WRN] Health check update: 42 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:52.350224 mon.node01 mon.0 11.11.11.1:6789/0 30753 : cluster [WRN] Health check update: Reduced data availability: 243 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:53.483983 mon.node01 mon.0 11.11.11.1:6789/0 30754 : cluster [WRN] Health check update: 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:57.861486 mon.node01 mon.0 11.11.11.1:6789/0 30755 : cluster [WRN] mon.2 11.11.11.3:6789/0 clock skew 6.8905s > max 0.05s
2018-04-01 00:03:57.861531 mon.node01 mon.0 11.11.11.1:6789/0 30756 : cluster [WRN] mon.1 11.11.11.2:6789/0 clock skew 5.01288s > max 0.05s
2018-04-01 00:03:58.484235 mon.node01 mon.0 11.11.11.1:6789/0 30757 : cluster [WRN] Health check update: Reduced data availability: 246 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:58.484264 mon.node01 mon.0 11.11.11.1:6789/0 30758 : cluster [WRN] Health check update: 46 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:03.485080 mon.node01 mon.0 11.11.11.1:6789/0 30759 : cluster [WRN] Health check update: Reduced data availability: 92 pgs inactive, 253 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:03.485134 mon.node01 mon.0 11.11.11.1:6789/0 30760 : cluster [WRN] Health check update: 48 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:08.485449 mon.node01 mon.0 11.11.11.1:6789/0 30761 : cluster [WRN] Health check update: Reduced data availability: 92 pgs inactive, 256 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:08.485476 mon.node01 mon.0 11.11.11.1:6789/0 30762 : cluster [WRN] Health check update: 50 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:13.485753 mon.node01 mon.0 11.11.11.1:6789/0 30763 : cluster [WRN] Health check update: 52 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:18.360329 mon.node01 mon.0 11.11.11.1:6789/0 30764 : cluster [WRN] Health check update: Reduced data availability: 256 pgs inactive, 256 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:18.486125 mon.node01 mon.0 11.11.11.1:6789/0 30765 : cluster [WRN] Health check update: 54 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:23.486471 mon.node01 mon.0 11.11.11.1:6789/0 30766 : cluster [WRN] Health check update: 56 slow requests are blocked > 32 sec (REQUEST_SLOW)
 
Last edited:
Did one of the nodes (node01?) reboot/reset? Or the time is not in sync on all the servers.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!