Ceph does not recover on second node failure after 10 minutes

jsterr · Jul 4, 2025

Hi I tested a scenario with 5 pveceph nodes:

* 5 PVE-CEPH nodes
* 4 OSDs per node
* 5 Ceph MON
* SIZE 3 / MINSIZE 2

If I shutoff one of the 5 pveceph nodes, ceph will automatically recover after 10 minutes and sets osds down & out. everythings green again.
After shutting of another one, ceph does NOT automatically recover after 10 minutes and keeps 2 of 4 osds in status down & in.

Screenshot after first node failure and 10 minutes:

Screenshot after second node failure and 10 minutes:

looks like a bug to me or am I missing something?

ness1602 · Jul 4, 2025

[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
Why is this if mds is 1/1?

ness1602 · Jul 4, 2025

Also,is this a problem with too much pg per osd?

SteveITS · Jul 4, 2025

On any Ceph page it will show a rebuilding status in the lower left if you scroll down.

insufficient standby is for a backup server, it’s recommending a second. Was one turned off?

PGs are auto allocated by default.

jsterr · Jul 4, 2025

SteveITS said:
On any Ceph page it will show a rebuilding status in the lower left if you scroll down.

insufficient standby is for a backup server, it’s recommending a second. Was one turned off?

PGs are auto allocated by default.

I know but it does not rebuild but should

SteveITS · Jul 4, 2025

So the rebuilding section on that page is missing or shows something?

After the second failure it says 1 host down. Seems like that’s your issue? Not sure why it wouldn’t show 2…

VictorSTS · Jul 4, 2025

jsterr said:
looks like a bug to me or am I missing something?

TLDR
mon_osd_min_in_ratio is your friend [1]

Long story
By default it is 0.75, meaning that Ceph will not mark out a down OSD if there is already ~25% of OSD already marked out. That is, a minimum of 75% of OSD will remain in even if they are down hence no recovery will happen.

In your example, once pve-2 is down, 4 OSD are marked down, so 20% of your OSD will become out after mon_osd_down_out_interval (de fault 600 secs). Then, pve-3 goes down and Ceph can only mark down as much as 2 OSD, leaving ~70% of OSD in.

Haven't delved in the code nor have really gotten to the root of it, but Ceph seems to always allow one OSD to go down even if it goes down "sligtly" from mon_osd_min_in_ratio. I.e. in your example each OSD is exactly 5% of the total OSDs in the cluster: if Ceph complied exactly with the default mon_osd_min_in_ratio of 0'75 it should only allow 5 OSD to become down. Another example is if you want Ceph not to ever mark a down OSD out, mon_osd_min_in_ratio must be > 1 (i.e. 1.01).

Given the relatively small size of a PVE+Ceph cluster compared with the typical Ceph cluster, I find mon_osd_min_in_ratio a powerful tool to predict possible failures and how Ceph will behave: last thing I want if too many OSD break is that recovery fills too much the other OSDs.

[1] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio

jsterr · Jul 4, 2025

VictorSTS said:
TLDR
mon_osd_min_in_ratio is your friend [1]

Long story
By default it is 0.75, meaning that Ceph will not mark out a down OSD if there is already ~25% of OSD already marked out. That is, a minimum of 75% of OSD will remain in even if they are down hence no recovery will happen.

In your example, once pve-2 is down, 4 OSD are marked down, so 20% of your OSD will become out after mon_osd_down_out_interval (de fault 600 secs). Then, pve-3 goes down and Ceph can only mark down as much as 2 OSD, leaving ~70% of OSD in.

Haven't delved in the code nor have really gotten to the root of it, but Ceph seems to always allow one OSD to go down even if it goes down "sligtly" from mon_osd_min_in_ratio. I.e. in your example each OSD is exactly 5% of the total OSDs in the cluster: if Ceph complied exactly with the default mon_osd_min_in_ratio of 0'75 it should only allow 5 OSD to become down. Another example is if you want Ceph not to ever mark a down OSD out, mon_osd_min_in_ratio must be > 1 (i.e. 1.01).

Given the relatively small size of a PVE+Ceph cluster compared with the typical Ceph cluster, I find mon_osd_min_in_ratio a powerful tool to predict possible failures and how Ceph will behave: last thing I want if too many OSD break is that recovery fills too much the other OSDs.

[1] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio

Yes thats it! Thanks, it was not much of a problem I outed them manually. At least we know now why that happened, thanks!

Search

Search

Ceph does not recover on second node failure after 10 minutes

jsterr

Famous Member

Attachments

ness1602

Famous Member

ness1602

Famous Member

SteveITS

Active Member

jsterr

Famous Member

SteveITS

Active Member

VictorSTS

Distinguished Member

jsterr

Famous Member

We value your privacy