ceph osd Health check failed

farway · Apr 1, 2018

HI,everyone:
We have three hosts installed pve operating system, Each host has 4 hard drives (two hard-configured raid1s, installed operating system, and two as osd disk）。the ceph size set three and min_size set two。
tody，Osd status was found down today。

The Proxmox vesrion:
5.1-3

The ceph version is :
ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416) luminous (stable)

The ceph config is:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 11.11.11.0/26
fsid = de946c12-f7c1-43c3-9f1e-b8b7ffd63309
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 11.11.11.0/26

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.node03]
host = node03
mon addr = 11.11.11.3:6789

[mon.node01]
host = node01
mon addr = 11.11.11.1:6789

[mon.node02]
host = node02
mon addr = 11.11.11.2:6789

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.273
item osd.1 weight 0.273
}
host node02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.273
item osd.3 weight 0.273
}
host node03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.545
alg straw2
hash 0 # rjenkins1
item osd.4 weight 0.273
item osd.5 weight 0.273
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 1.636
alg straw2
hash 0 # rjenkins1
item node01 weight 0.545
item node02 weight 0.545
item node03 weight 0.545
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

the log is:
2018-04-01 00:00:00.000133 mon.node01 mon.0 11.11.11.1:6789/0 30733 : cluster [WRN] overall HEALTH_WARN clock skew detected on mon.node02, mon.node03
2018-04-01 00:03:08.478121 mon.node01 mon.0 11.11.11.1:6789/0 30734 : cluster [INF] osd.4 marked down after no beacon for 902.439619 seconds
2018-04-01 00:03:08.478159 mon.node01 mon.0 11.11.11.1:6789/0 30735 : cluster [INF] osd.5 marked down after no beacon for 904.439972 seconds
2018-04-01 00:03:08.479389 mon.node01 mon.0 11.11.11.1:6789/0 30736 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2018-04-01 00:03:08.479415 mon.node01 mon.0 11.11.11.1:6789/0 30737 : cluster [WRN] Health check failed: 1 host (2 osds) down (OSD_HOST_DOWN)
2018-04-01 00:03:10.338185 mon.node01 mon.0 11.11.11.1:6789/0 30739 : cluster [WRN] Health check failed: Reduced data availability: 108 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:16.337051 mon.node01 mon.0 11.11.11.1:6789/0 30740 : cluster [WRN] Health check update: Reduced data availability: 216 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:22.338847 mon.node01 mon.0 11.11.11.1:6789/0 30741 : cluster [WRN] Health check update: Reduced data availability: 222 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:23.480597 mon.node01 mon.0 11.11.11.1:6789/0 30742 : cluster [INF] osd.2 marked down after no beacon for 903.436737 seconds
2018-04-01 00:03:23.480644 mon.node01 mon.0 11.11.11.1:6789/0 30743 : cluster [INF] osd.3 marked down after no beacon for 903.436837 seconds
2018-04-01 00:03:23.481937 mon.node01 mon.0 11.11.11.1:6789/0 30744 : cluster [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-04-01 00:03:23.481970 mon.node01 mon.0 11.11.11.1:6789/0 30745 : cluster [WRN] Health check update: 2 hosts (4 osds) down (OSD_HOST_DOWN)
2018-04-01 00:03:28.482648 mon.node01 mon.0 11.11.11.1:6789/0 30747 : cluster [WRN] Health check update: Reduced data availability: 236 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:30.345950 mon.node01 mon.0 11.11.11.1:6789/0 30748 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:33.482927 mon.node01 mon.0 11.11.11.1:6789/0 30749 : cluster [WRN] Health check update: Reduced data availability: 237 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:36.344846 mon.node01 mon.0 11.11.11.1:6789/0 30750 : cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:46.348479 mon.node01 mon.0 11.11.11.1:6789/0 30751 : cluster [WRN] Health check update: Reduced data availability: 240 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:46.348509 mon.node01 mon.0 11.11.11.1:6789/0 30752 : cluster [WRN] Health check update: 42 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:52.350224 mon.node01 mon.0 11.11.11.1:6789/0 30753 : cluster [WRN] Health check update: Reduced data availability: 243 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:53.483983 mon.node01 mon.0 11.11.11.1:6789/0 30754 : cluster [WRN] Health check update: 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:03:57.861486 mon.node01 mon.0 11.11.11.1:6789/0 30755 : cluster [WRN] mon.2 11.11.11.3:6789/0 clock skew 6.8905s > max 0.05s
2018-04-01 00:03:57.861531 mon.node01 mon.0 11.11.11.1:6789/0 30756 : cluster [WRN] mon.1 11.11.11.2:6789/0 clock skew 5.01288s > max 0.05s
2018-04-01 00:03:58.484235 mon.node01 mon.0 11.11.11.1:6789/0 30757 : cluster [WRN] Health check update: Reduced data availability: 246 pgs peering (PG_AVAILABILITY)
2018-04-01 00:03:58.484264 mon.node01 mon.0 11.11.11.1:6789/0 30758 : cluster [WRN] Health check update: 46 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:03.485080 mon.node01 mon.0 11.11.11.1:6789/0 30759 : cluster [WRN] Health check update: Reduced data availability: 92 pgs inactive, 253 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:03.485134 mon.node01 mon.0 11.11.11.1:6789/0 30760 : cluster [WRN] Health check update: 48 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:08.485449 mon.node01 mon.0 11.11.11.1:6789/0 30761 : cluster [WRN] Health check update: Reduced data availability: 92 pgs inactive, 256 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:08.485476 mon.node01 mon.0 11.11.11.1:6789/0 30762 : cluster [WRN] Health check update: 50 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:13.485753 mon.node01 mon.0 11.11.11.1:6789/0 30763 : cluster [WRN] Health check update: 52 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:18.360329 mon.node01 mon.0 11.11.11.1:6789/0 30764 : cluster [WRN] Health check update: Reduced data availability: 256 pgs inactive, 256 pgs peering (PG_AVAILABILITY)
2018-04-01 00:04:18.486125 mon.node01 mon.0 11.11.11.1:6789/0 30765 : cluster [WRN] Health check update: 54 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-04-01 00:04:23.486471 mon.node01 mon.0 11.11.11.1:6789/0 30766 : cluster [WRN] Health check update: 56 slow requests are blocked > 32 sec (REQUEST_SLOW)

Alwin · Apr 3, 2018

Did one of the nodes (node01?) reboot/reset? Or the time is not in sync on all the servers.

Search

Search

ceph osd Health check failed

farway

New Member

Alwin

Proxmox Retired Staff