TEST SETUP
My testing setup Proxmox cluster:
Node1- dedicated OVH production , reboot time 110 sec
Node2 - dedicated OVH backup server
Node3 - VPS OVH cloud, reboot time several seconds
Share storage ceph
HA groups:
HA12 - node1 priority 2, node2 priority 1, restricted, unchecked nofailback
HA32 - node3 priority 2, node2 priority 1, restricted, unchecked nofailback
Hardware watchdog configured with ipmi_watchdog:
cat /etc/modprobe.d/ipmi_watchdog.conf
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10
Test1 working:
1)Trigger kernel crash on node1 with echo c > /proc/sysrq-trigger
2)hardware watchdog manage to reboot the dedicated server ( same setup softdog fails)
3)Keep the node1 fenced, so during reboot I enter in bios setup
4)The Vms restarted successfully on node2
5)Reset node1
6)Node1 again active ( is a production server, more powerful, with nofailback unchecked on purpose in HA12)
7) the Vms returns to node1
THE ISSUE
In the above test, If I skip step 3 and let node1 reboot by hardware watchdog after kernel crash, I want to keep the Vms on node1.
This is a genuine reboot try, during fence delay window.
My problem is that fence delay is only 60, and my server reboot time is 110. Before having a chance to reboot, the VMs restarts on node2.
$fence_delay = 60; in /usr/share/perl5/PVE/HA/NodeStatus.pm
Repeating the same test with node3 ( which boots very fast, being a VPS) all the Vms stays on node3 and restarts on here, after grabbing the lock.
INCREASE FENCE DELAY
I saw a commit by Dietmar, increasing fence_delay from 30 to 60 second. https://git.proxmox.com/?p=pve-ha-m...ff;h=ceac1930e8747b758982396949e14d9f0c8b13fd
Btw, should this option be configurable from GUI?
Now, the option is in /usr/share/perl5/PVE/HA/NodeStatus.pm. Increasing the option in both node2/node1/node3 didn’t change the test workflow, won’t work! ( edit with vim and reboot).
Thank you!
My testing setup Proxmox cluster:
Node1- dedicated OVH production , reboot time 110 sec
Node2 - dedicated OVH backup server
Node3 - VPS OVH cloud, reboot time several seconds
Share storage ceph
HA groups:
HA12 - node1 priority 2, node2 priority 1, restricted, unchecked nofailback
HA32 - node3 priority 2, node2 priority 1, restricted, unchecked nofailback
Hardware watchdog configured with ipmi_watchdog:
cat /etc/modprobe.d/ipmi_watchdog.conf
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10
Test1 working:
1)Trigger kernel crash on node1 with echo c > /proc/sysrq-trigger
2)hardware watchdog manage to reboot the dedicated server ( same setup softdog fails)
3)Keep the node1 fenced, so during reboot I enter in bios setup
4)The Vms restarted successfully on node2
5)Reset node1
6)Node1 again active ( is a production server, more powerful, with nofailback unchecked on purpose in HA12)
7) the Vms returns to node1
THE ISSUE
In the above test, If I skip step 3 and let node1 reboot by hardware watchdog after kernel crash, I want to keep the Vms on node1.
This is a genuine reboot try, during fence delay window.
My problem is that fence delay is only 60, and my server reboot time is 110. Before having a chance to reboot, the VMs restarts on node2.
$fence_delay = 60; in /usr/share/perl5/PVE/HA/NodeStatus.pm
Repeating the same test with node3 ( which boots very fast, being a VPS) all the Vms stays on node3 and restarts on here, after grabbing the lock.
INCREASE FENCE DELAY
I saw a commit by Dietmar, increasing fence_delay from 30 to 60 second. https://git.proxmox.com/?p=pve-ha-m...ff;h=ceac1930e8747b758982396949e14d9f0c8b13fd
Btw, should this option be configurable from GUI?
Now, the option is in /usr/share/perl5/PVE/HA/NodeStatus.pm. Increasing the option in both node2/node1/node3 didn’t change the test workflow, won’t work! ( edit with vim and reboot).
Thank you!