PVE 4 HA Simulation and testing

alitvak69 · Oct 16, 2015

For better understanding of new HA mechanism, I have decided to try you pve-ha-simulator.

I started all nodes and enabled one vm:101.
Then I migrated vm:101 to node 2 so far so good.
Finally I disabled network on node2.
Simulator fenced the node2 and started vm:101 on node1, however it took 3 minutes to complete.
Does it sound normal? Can the timers be adjusted up or down?

Here is the log of simulator:

Code:

nfo    06:41:37     hardware: execute network node2 off
info    06:41:45    node2/lrm: status change active => lost_agent_lock
info    06:41:50    node2/crm: status change master => lost_manager_lock
info    06:41:50    node2/crm: status change lost_manager_lock => wait_for_quorum
info    06:42:31     watchdog: execute power node2 off
info    06:42:31     hardware: crm on node 'node2' killed by poweroff
info    06:42:31     hardware: lrm on node 'node2' killed by poweroff
info    06:42:31     hardware: execute power node2 off
info    06:42:31     hardware: server 'node2' stopped by poweroff (watchdog)
info    06:43:36    node1/crm: got lock 'ha_manager_lock'
info    06:43:36    node1/crm: status change slave => master
info    06:43:36    node1/crm: node 'node2': state changed from 'online' => 'unknown'
info    06:44:36    node1/crm: service 'vm:101': state changed from 'started' to 'fence' 
info    06:44:36    node1/crm: node 'node2': state changed from 'unknown' => 'fence'
info    06:44:36    node1/crm: got lock 'ha_agent_node2_lock'
info    06:44:36    node1/crm: fencing: acknowleged - got agent lock for node 'node2'
info    06:44:36    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
info    06:44:36    node1/crm: service 'vm:101': state changed from 'fence' to 'stopped' 
info    06:44:36    node1/crm: service 'vm:101': state changed from 'stopped' to 'started'  (node = node1)

Thank you,

alitvak69 · Oct 16, 2015

I also tried a node power off simulation.

I killed node1 with vm:101 running on it

It took approximately 3 minutes to fence the node 1
Then vm:101 started on node2 almost immediately.

Again something doesn't compute here, why wait 3 minutes to fence the dead node?

Code:

info    07:01:24     hardware: execute power node1 off
info    07:01:24     hardware: crm on node 'node1' killed by poweroff
info    07:01:24     hardware: lrm on node 'node1' killed by poweroff
info    07:01:24     hardware: execute network node1 off
info    07:03:17    node2/crm: got lock 'ha_manager_lock'
info    07:03:17    node2/crm: status change slave => master
info    07:03:17    node2/crm: node 'node1': state changed from 'online' => 'unknown'
info    07:04:17    node2/crm: service 'vm:101': state changed from 'started' to 'fence' 
info    07:04:17    node2/crm: node 'node1': state changed from 'unknown' => 'fence'
info    07:04:17    node2/crm: got lock 'ha_agent_node1_lock'
info    07:04:17    node2/crm: fencing: acknowleged - got agent lock for node 'node1'
info    07:04:17    node2/crm: node 'node1': state changed from 'fence' => 'unknown'
info    07:04:17    node2/crm: service 'vm:101': state changed from 'fence' to 'stopped' 
info    07:04:17    node2/crm: service 'vm:101': state changed from 'stopped' to 'started'  (node = node2)
info    07:04:21    node2/lrm: got lock 'ha_agent_node2_lock'
info    07:04:21    node2/lrm: status change wait_for_agent_lock => active

dietmar · Oct 16, 2015

alitvak69 said:
Again something doesn't compute here, why wait 3 minutes to fence the dead node?

We use locking with 120 second timeouts. The simulator is not 100% accurate, but you will get very similar values with real hardware.

alitvak69 · Oct 16, 2015

Dietmar,

Thank you for your reply.

So there is 120 seconds for locking and 60 seconds for fencing, correct?

Are there any plans to make those numbers configurable?

Finally I tested 2 out of three nodes down situation

Code:

info    07:34:47     hardware: crm on node 'node2' killed by poweroff
info    07:34:47     hardware: lrm on node 'node2' killed by poweroff
info    07:34:47     hardware: execute network node2 off
info    07:34:55     hardware: execute network node3 off
info    07:34:56    node1/crm: status change slave => wait_for_quorum
info    07:34:56    node3/crm: status change slave => wait_for_quorum
info    07:35:04    node3/lrm: status change active => lost_agent_lock
info    07:35:50     watchdog: execute power node3 off
info    07:35:50     hardware: crm on node 'node3' killed by poweroff
info    07:35:50     hardware: lrm on node 'node3' killed by poweroff
info    07:35:50     hardware: execute power node3 off
info    07:35:50     hardware: server 'node3' stopped by poweroff (watchdog)
info    07:37:59     hardware: execute power node2 on
info    07:37:59     hardware: execute network node2 on
info    07:37:59    node2/crm: status change startup => wait_for_quorum
info    07:37:59    node2/crm: got lock 'ha_manager_lock'
info    07:37:59    node2/crm: status change wait_for_quorum => master
info    07:37:59    node2/crm: node 'node3': state changed from 'online' => 'unknown'
info    07:37:59    node2/lrm: status change startup => wait_for_agent_lock
info    07:37:59    node2/lrm: got lock 'ha_agent_node2_lock'
info    07:37:59    node2/lrm: status change wait_for_agent_lock => active
info    07:38:06    node1/crm: status change wait_for_quorum => slave

I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

Also reading corosync 2 documents, I see features like LMS (last_man_standing) and ATB (auto_tie_braker). Will these options work within proxmox framework?
I know the ATB is not usually recommended but in some cases I need my services to stay on even if single node remains.

dietmar · Oct 16, 2015

alitvak69 said:
So there is 120 seconds for locking and 60 seconds for fencing, correct?

But those times can overlap

alitvak69 said:
Are there any plans to make those numbers configurable?

no

alitvak69 said:
Finally I tested 2 out of three nodes down situation

I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

yes, should do

alitvak69 said:
Also reading corosync 2 documents, I see features like LMS (last_man_standing) and ATB (auto_tie_braker). Will these options work within proxmox framework?

Yes

dietmar · Oct 16, 2015

alitvak69 said:
I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

Just tested - works for me. But it is quite impossible to say what happens with those incomplete logs.

Search

Search

PVE 4 HA Simulation and testing

alitvak69

Renowned Member

alitvak69

Renowned Member

dietmar

Proxmox Staff Member

alitvak69

Renowned Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

We value your privacy