PVE 4 HA Simulation and testing

alitvak69

Renowned Member
Oct 2, 2015
105
3
83
For better understanding of new HA mechanism, I have decided to try you pve-ha-simulator.

I started all nodes and enabled one vm:101.
Then I migrated vm:101 to node 2 so far so good.
Finally I disabled network on node2.
Simulator fenced the node2 and started vm:101 on node1, however it took 3 minutes to complete.
Does it sound normal? Can the timers be adjusted up or down?

Here is the log of simulator:

Code:
nfo    06:41:37     hardware: execute network node2 off
info    06:41:45    node2/lrm: status change active => lost_agent_lock
info    06:41:50    node2/crm: status change master => lost_manager_lock
info    06:41:50    node2/crm: status change lost_manager_lock => wait_for_quorum
info    06:42:31     watchdog: execute power node2 off
info    06:42:31     hardware: crm on node 'node2' killed by poweroff
info    06:42:31     hardware: lrm on node 'node2' killed by poweroff
info    06:42:31     hardware: execute power node2 off
info    06:42:31     hardware: server 'node2' stopped by poweroff (watchdog)
info    06:43:36    node1/crm: got lock 'ha_manager_lock'
info    06:43:36    node1/crm: status change slave => master
info    06:43:36    node1/crm: node 'node2': state changed from 'online' => 'unknown'
info    06:44:36    node1/crm: service 'vm:101': state changed from 'started' to 'fence' 
info    06:44:36    node1/crm: node 'node2': state changed from 'unknown' => 'fence'
info    06:44:36    node1/crm: got lock 'ha_agent_node2_lock'
info    06:44:36    node1/crm: fencing: acknowleged - got agent lock for node 'node2'
info    06:44:36    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
info    06:44:36    node1/crm: service 'vm:101': state changed from 'fence' to 'stopped' 
info    06:44:36    node1/crm: service 'vm:101': state changed from 'stopped' to 'started'  (node = node1)

Thank you,
 
I also tried a node power off simulation.

I killed node1 with vm:101 running on it

It took approximately 3 minutes to fence the node 1
Then vm:101 started on node2 almost immediately.

Again something doesn't compute here, why wait 3 minutes to fence the dead node?

Code:
info    07:01:24     hardware: execute power node1 off
info    07:01:24     hardware: crm on node 'node1' killed by poweroff
info    07:01:24     hardware: lrm on node 'node1' killed by poweroff
info    07:01:24     hardware: execute network node1 off
info    07:03:17    node2/crm: got lock 'ha_manager_lock'
info    07:03:17    node2/crm: status change slave => master
info    07:03:17    node2/crm: node 'node1': state changed from 'online' => 'unknown'
info    07:04:17    node2/crm: service 'vm:101': state changed from 'started' to 'fence' 
info    07:04:17    node2/crm: node 'node1': state changed from 'unknown' => 'fence'
info    07:04:17    node2/crm: got lock 'ha_agent_node1_lock'
info    07:04:17    node2/crm: fencing: acknowleged - got agent lock for node 'node1'
info    07:04:17    node2/crm: node 'node1': state changed from 'fence' => 'unknown'
info    07:04:17    node2/crm: service 'vm:101': state changed from 'fence' to 'stopped' 
info    07:04:17    node2/crm: service 'vm:101': state changed from 'stopped' to 'started'  (node = node2)
info    07:04:21    node2/lrm: got lock 'ha_agent_node2_lock'
info    07:04:21    node2/lrm: status change wait_for_agent_lock => active
 
Again something doesn't compute here, why wait 3 minutes to fence the dead node?

We use locking with 120 second timeouts. The simulator is not 100% accurate, but you will get very similar values with real hardware.
 
Dietmar,

Thank you for your reply.

So there is 120 seconds for locking and 60 seconds for fencing, correct?

Are there any plans to make those numbers configurable?

Finally I tested 2 out of three nodes down situation

Code:
info    07:34:47     hardware: crm on node 'node2' killed by poweroff
info    07:34:47     hardware: lrm on node 'node2' killed by poweroff
info    07:34:47     hardware: execute network node2 off
info    07:34:55     hardware: execute network node3 off
info    07:34:56    node1/crm: status change slave => wait_for_quorum
info    07:34:56    node3/crm: status change slave => wait_for_quorum
info    07:35:04    node3/lrm: status change active => lost_agent_lock
info    07:35:50     watchdog: execute power node3 off
info    07:35:50     hardware: crm on node 'node3' killed by poweroff
info    07:35:50     hardware: lrm on node 'node3' killed by poweroff
info    07:35:50     hardware: execute power node3 off
info    07:35:50     hardware: server 'node3' stopped by poweroff (watchdog)
info    07:37:59     hardware: execute power node2 on
info    07:37:59     hardware: execute network node2 on
info    07:37:59    node2/crm: status change startup => wait_for_quorum
info    07:37:59    node2/crm: got lock 'ha_manager_lock'
info    07:37:59    node2/crm: status change wait_for_quorum => master
info    07:37:59    node2/crm: node 'node3': state changed from 'online' => 'unknown'
info    07:37:59    node2/lrm: status change startup => wait_for_agent_lock
info    07:37:59    node2/lrm: got lock 'ha_agent_node2_lock'
info    07:37:59    node2/lrm: status change wait_for_agent_lock => active
info    07:38:06    node1/crm: status change wait_for_quorum => slave

I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

Also reading corosync 2 documents, I see features like LMS (last_man_standing) and ATB (auto_tie_braker). Will these options work within proxmox framework?
I know the ATB is not usually recommended but in some cases I need my services to stay on even if single node remains.
 
So there is 120 seconds for locking and 60 seconds for fencing, correct?

But those times can overlap

Are there any plans to make those numbers configurable?

no

Finally I tested 2 out of three nodes down situation

I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

yes, should do

Also reading corosync 2 documents, I see features like LMS (last_man_standing) and ATB (auto_tie_braker). Will these options work within proxmox framework?

Yes
 
I understand that by default last node will not run services due to loss of quorum, but shouldn't simulator restart the vm:101 when two nodes are up?

Just tested - works for me. But it is quite impossible to say what happens with those incomplete logs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!