Hi,
we are testing some failure scenarios and spotted unexpected long delay on disk access availability:
Scenario: 3 x pve hosts, every has mgr,mon,2 osds, replica 3/2
1] hard power loss on one node
result: ~ 24s before disks available (grace period >= 20 seconds)
I thought, that node fail will not make such impact, so found those defaults:
osd_hearbeat_grace = 20 #changed to 10
osd_hearbeat_interval = 6 #changed to 3
New test and result is again more than 20 seconds. So, some parameter need to tune.
Anybody tunned it? And how? Or majority is running with defaults? 20+ sec all cluster disks non-availability looks as huge gap...
we are testing some failure scenarios and spotted unexpected long delay on disk access availability:
Scenario: 3 x pve hosts, every has mgr,mon,2 osds, replica 3/2
1] hard power loss on one node
result: ~ 24s before disks available (grace period >= 20 seconds)
I thought, that node fail will not make such impact, so found those defaults:
osd_hearbeat_grace = 20 #changed to 10
osd_hearbeat_interval = 6 #changed to 3
New test and result is again more than 20 seconds. So, some parameter need to tune.
Anybody tunned it? And how? Or majority is running with defaults? 20+ sec all cluster disks non-availability looks as huge gap...