Ceph does not recover on second node failure after 10 minutes

jsterr

Renowned Member
Jul 24, 2020
860
247
88
33
Hi I tested a scenario with 5 pveceph nodes:

* 5 PVE-CEPH nodes
* 4 OSDs per node
* 5 Ceph MON
* SIZE 3 / MINSIZE 2

If I shutoff one of the 5 pveceph nodes, ceph will automatically recover after 10 minutes and sets osds down & out. everythings green again.
After shutting of another one, ceph does NOT automatically recover after 10 minutes and keeps 2 of 4 osds in status down & in.

Screenshot after first node failure and 10 minutes:
1751619843972.png

Screenshot after second node failure and 10 minutes:

1751619889460.png
looks like a bug to me or am I missing something?
 

Attachments

Last edited:
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
Why is this if mds is 1/1?
 
On any Ceph page it will show a rebuilding status in the lower left if you scroll down.

insufficient standby is for a backup server, it’s recommending a second. Was one turned off?

PGs are auto allocated by default.
 
On any Ceph page it will show a rebuilding status in the lower left if you scroll down.

insufficient standby is for a backup server, it’s recommending a second. Was one turned off?

PGs are auto allocated by default.

I know but it does not rebuild but should
 
So the rebuilding section on that page is missing or shows something?

After the second failure it says 1 host down. Seems like that’s your issue? Not sure why it wouldn’t show 2…
 
looks like a bug to me or am I missing something?
TLDR
mon_osd_min_in_ratio is your friend [1]

Long story
By default it is 0.75, meaning that Ceph will not mark out a down OSD if there is already ~25% of OSD already marked out. That is, a minimum of 75% of OSD will remain in even if they are down hence no recovery will happen.

In your example, once pve-2 is down, 4 OSD are marked down, so 20% of your OSD will become out after mon_osd_down_out_interval (de fault 600 secs). Then, pve-3 goes down and Ceph can only mark down as much as 2 OSD, leaving ~70% of OSD in.

Haven't delved in the code nor have really gotten to the root of it, but Ceph seems to always allow one OSD to go down even if it goes down "sligtly" from mon_osd_min_in_ratio. I.e. in your example each OSD is exactly 5% of the total OSDs in the cluster: if Ceph complied exactly with the default mon_osd_min_in_ratio of 0'75 it should only allow 5 OSD to become down. Another example is if you want Ceph not to ever mark a down OSD out, mon_osd_min_in_ratio must be > 1 (i.e. 1.01).

Given the relatively small size of a PVE+Ceph cluster compared with the typical Ceph cluster, I find mon_osd_min_in_ratio a powerful tool to predict possible failures and how Ceph will behave: last thing I want if too many OSD break is that recovery fills too much the other OSDs.

[1] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio
 
TLDR
mon_osd_min_in_ratio is your friend [1]

Long story
By default it is 0.75, meaning that Ceph will not mark out a down OSD if there is already ~25% of OSD already marked out. That is, a minimum of 75% of OSD will remain in even if they are down hence no recovery will happen.

In your example, once pve-2 is down, 4 OSD are marked down, so 20% of your OSD will become out after mon_osd_down_out_interval (de fault 600 secs). Then, pve-3 goes down and Ceph can only mark down as much as 2 OSD, leaving ~70% of OSD in.

Haven't delved in the code nor have really gotten to the root of it, but Ceph seems to always allow one OSD to go down even if it goes down "sligtly" from mon_osd_min_in_ratio. I.e. in your example each OSD is exactly 5% of the total OSDs in the cluster: if Ceph complied exactly with the default mon_osd_min_in_ratio of 0'75 it should only allow 5 OSD to become down. Another example is if you want Ceph not to ever mark a down OSD out, mon_osd_min_in_ratio must be > 1 (i.e. 1.01).

Given the relatively small size of a PVE+Ceph cluster compared with the typical Ceph cluster, I find mon_osd_min_in_ratio a powerful tool to predict possible failures and how Ceph will behave: last thing I want if too many OSD break is that recovery fills too much the other OSDs.

[1] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio

Yes thats it! Thanks, it was not much of a problem I outed them manually. At least we know now why that happened, thanks!