ceph treatment of failed/failing osds

alexskysilk · Dec 23, 2024

This may not be a question for PVE forums but no harm in asking

My experience with ceph treatment of OSDs is that the subsystem does not fail OSDs unless they are completely dropped on the bus. Failing disks (eg with read failures, even with trapped sense key errors) do NOT get dropped, and the OSD remains up and In- even if it bounces up and down slowing down all transactions. If I manually out a failing OSD, it eventually leads to slow ops but the OSD STILL does not get failed out and I have to go and manually down it.

Is this behavior controllable at all?

alexskysilk · Dec 23, 2024

to answer my own question and for future search results:

pre reef: monitor syslog and manually stop/out OSDs
reef+:

Code:

ceph config set osd osd_fail_on_smart_health true
ceph config set osd bluestore_disk_thread_iops_avg_min 10
ceph config set osd bluestore_disk_thread_high_latency_us 500000 # select a value relevent to type of device

Search

Search

ceph treatment of failed/failing osds

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member

We value your privacy