ceph treatment of failed/failing osds

alexskysilk

Distinguished Member
Oct 16, 2015
2,586
854
213
Chatsworth, CA
www.skysilk.com
This may not be a question for PVE forums but no harm in asking :)

My experience with ceph treatment of OSDs is that the subsystem does not fail OSDs unless they are completely dropped on the bus. Failing disks (eg with read failures, even with trapped sense key errors) do NOT get dropped, and the OSD remains up and In- even if it bounces up and down slowing down all transactions. If I manually out a failing OSD, it eventually leads to slow ops but the OSD STILL does not get failed out and I have to go and manually down it.

Is this behavior controllable at all?
 
to answer my own question and for future search results:

pre reef: monitor syslog and manually stop/out OSDs
reef+:
Code:
ceph config set osd osd_fail_on_smart_health true
ceph config set osd bluestore_disk_thread_iops_avg_min 10
ceph config set osd bluestore_disk_thread_high_latency_us 500000 # select a value relevent to type of device
 
  • Like
Reactions: UdoB