[SOLVED] Ceph host down failover time

rustine22

New Member
Jun 9, 2024
24
3
3
Hi,

in a 3 node testing ceph cluster, when i shutdown down power of a node, write operation are hanging during about 30 seconds. I would like to reduce this time to 10-15 seconds if possible (i assume my network is healthy and no risk of saturation).

I tried to change values :

Code:
root@pve01:~#  ceph config set osd osd_heartbeat_interval 5
root@pve01:~#  ceph config set osd osd_heartbeat_grace 12


But no luck, write operation are restored in 30 seconds :

Code:
2024-12-08T15:31:58.249932+0100 mon.pve01 (mon.0) 752 : cluster [INF] osd.2 failed (root=default,host=pve03) (2 reporters from different host after 23.894888 >= grace 20.000000)
2024-12-08T15:31:58.251243+0100 mon.pve01 (mon.0) 753 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-12-08T15:31:58.251417+0100 mon.pve01 (mon.0) 754 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)

I don't understand why grace period of OSD_DOWN or OSD_HOST_DOWN is not reduced ?

Thanks
 
Please note I don't use Ceph at all, but I'm thinking that the settings you have made are only for the OSD & not for the Monitor.
Upon further research this is what I found from here:

osd_heartbeat_grace
The elapsed time when a Ceph OSD Daemon hasn’t shown a heartbeat that the Ceph Storage Cluster considers it down. This setting must be set in both the [mon] and [osd] or [global] sections so that it is read by both monitor and OSD daemons.

type: int
default: 20


I have underlined the relevant sentence.

Hope this may help you.
 
Thanks a lot, it works perfectly !

Code:
root@pve01:~#  ceph config set osd osd_heartbeat_interval 5
root@pve01:~#  ceph config set osd osd_heartbeat_grace 12

root@pve01:~# ceph config set mon osd_heartbeat_interval 5
root@pve01:~# ceph config set mon osd_heartbeat_grace 12

And the result when a node is powered off : about 15 seconds to failover.

Code:
2024-12-17T19:43:27.408083+0100 mon.pve01 [WRN] Health check failed: 1/3 mons down, quorum pve01,pve02 (MON_DOWN)
2024-12-17T19:43:27.459193+0100 mon.pve01 [INF] osd.2 failed (root=default,host=pve03) (2 reporters from different host after 14.994204 >= grace 7.846142)
2024-12-17T19:43:27.527541+0100 mon.pve01 [WRN] Health detail: HEALTH_WARN 1/3 mons down, quorum pve01,pve02
2024-12-17T19:43:27.527618+0100 mon.pve01 [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum pve01,pve02
2024-12-17T19:43:27.527683+0100 mon.pve01 [WRN]     mon.pve03 (rank 2) addr [v2:10.3.200.103:3300/0,v1:10.3.200.103:6789/0] is down (out of quorum)
2024-12-17T19:43:28.384476+0100 mon.pve01 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-12-17T19:43:28.384544+0100 mon.pve01 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
 
You are fast! Happy to help.

Since it appears you have solved your issue, Maybe mark this thread as solved. At the top of the thread, choose the Edit thread button, then from the (no prefix) dropdown choose Solved.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!