ceph treatment of failed/failing osds

alexskysilk

Distinguished Member
Oct 16, 2015
1,905
398
153
Chatsworth, CA
www.skysilk.com
This may not be a question for PVE forums but no harm in asking :)

My experience with ceph treatment of OSDs is that the subsystem does not fail OSDs unless they are completely dropped on the bus. Failing disks (eg with read failures, even with trapped sense key errors) do NOT get dropped, and the OSD remains up and In- even if it bounces up and down slowing down all transactions. If I manually out a failing OSD, it eventually leads to slow ops but the OSD STILL does not get failed out and I have to go and manually down it.

Is this behavior controllable at all?
 
to answer my own question and for future search results:

pre reef: monitor syslog and manually stop/out OSDs
reef+:
Code:
ceph config set osd osd_fail_on_smart_health true
ceph config set osd bluestore_disk_thread_iops_avg_min 10
ceph config set osd bluestore_disk_thread_high_latency_us 500000 # select a value relevent to type of device
 
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!