[SOLVED] Ceph host down failover time

rustine22 · Dec 17, 2024

Hi,

in a 3 node testing ceph cluster, when i shutdown down power of a node, write operation are hanging during about 30 seconds. I would like to reduce this time to 10-15 seconds if possible (i assume my network is healthy and no risk of saturation).

I tried to change values :

Code:

root@pve01:~#  ceph config set osd osd_heartbeat_interval 5
root@pve01:~#  ceph config set osd osd_heartbeat_grace 12

But no luck, write operation are restored in 30 seconds :

Code:

2024-12-08T15:31:58.249932+0100 mon.pve01 (mon.0) 752 : cluster [INF] osd.2 failed (root=default,host=pve03) (2 reporters from different host after 23.894888 >= grace 20.000000)
2024-12-08T15:31:58.251243+0100 mon.pve01 (mon.0) 753 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-12-08T15:31:58.251417+0100 mon.pve01 (mon.0) 754 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)

I don't understand why grace period of OSD_DOWN or OSD_HOST_DOWN is not reduced ?

Thanks

gfngfn256 · Dec 17, 2024

Please note I don't use Ceph at all, but I'm thinking that the settings you have made are only for the OSD & not for the Monitor.
Upon further research this is what I found from here:

osd_heartbeat_grace
The elapsed time when a Ceph OSD Daemon hasn’t shown a heartbeat that the Ceph Storage Cluster considers it down. This setting must be set in both the [mon] and [osd] or [global] sections so that it is read by both monitor and OSD daemons.

type: int
default: 20

I have underlined the relevant sentence.

Hope this may help you.

rustine22 · Dec 17, 2024

Thanks a lot, it works perfectly !

Code:

root@pve01:~#  ceph config set osd osd_heartbeat_interval 5
root@pve01:~#  ceph config set osd osd_heartbeat_grace 12

root@pve01:~# ceph config set mon osd_heartbeat_interval 5
root@pve01:~# ceph config set mon osd_heartbeat_grace 12

And the result when a node is powered off : about 15 seconds to failover.

Code:

2024-12-17T19:43:27.408083+0100 mon.pve01 [WRN] Health check failed: 1/3 mons down, quorum pve01,pve02 (MON_DOWN)
2024-12-17T19:43:27.459193+0100 mon.pve01 [INF] osd.2 failed (root=default,host=pve03) (2 reporters from different host after 14.994204 >= grace 7.846142)
2024-12-17T19:43:27.527541+0100 mon.pve01 [WRN] Health detail: HEALTH_WARN 1/3 mons down, quorum pve01,pve02
2024-12-17T19:43:27.527618+0100 mon.pve01 [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum pve01,pve02
2024-12-17T19:43:27.527683+0100 mon.pve01 [WRN]     mon.pve03 (rank 2) addr [v2:10.3.200.103:3300/0,v1:10.3.200.103:6789/0] is down (out of quorum)
2024-12-17T19:43:28.384476+0100 mon.pve01 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-12-17T19:43:28.384544+0100 mon.pve01 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)

gfngfn256 · Dec 17, 2024

You are fast! Happy to help.

Since it appears you have solved your issue, Maybe mark this thread as solved. At the top of the thread, choose the Edit thread button, then from the (no prefix) dropdown choose Solved.

Search

Search

[SOLVED] Ceph host down failover time

rustine22

New Member

gfngfn256

Distinguished Member

rustine22

New Member

gfngfn256

Distinguished Member

We value your privacy