Quorum dissolved HA VM shutdown

  • Thread starter Thread starter Ninjix
  • Start date Start date
N

Ninjix

Guest
Is this expected behavior?

I just restarted two out of my four PVE 2.0 RC1 test nodes at the same time. The rgmanager reported the quorum dissolved as node-4 was shut down. The HA machine (VM 100) was shutdown by the cluster on node-01 even though that node remained online. Once node-3 completed its restart and the Quorum was regained, VM 100 was restarted on node-02. Any ideas on what may have caused this?

Before:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD]VM 100 running[/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[/TABLE]

During:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD]VM 100 shutdown[/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Restarting[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Restarting[/TD]
[TD][/TD]
[/TR]
[/TABLE]


After:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD]VM 100 running[/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[/TABLE]

Here is a copy of the rgmanager log from node-01.

Code:
...
Mar 05 15:00:07 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:08 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:38 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:49 rgmanager #1: Quorum Dissolved
Mar 05 15:00:51 rgmanager [pvevm] Task still active, waiting
Mar 05 15:00:52 rgmanager [pvevm] Task still active, waiting
Mar 05 15:02:33 rgmanager Quorum Regained
Mar 05 15:02:34 rgmanager State change: Local UP
Mar 05 15:02:34 rgmanager State change: node-02 UP
Mar 05 15:02:34 rgmanager Loading Service Data
Mar 05 15:02:36 rgmanager Skipping stop-before-start: overridden by administrator
Mar 05 15:02:36 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:40 rgmanager State change: node-03 UP
Mar 05 15:02:52 rgmanager Migration: pvevm:100 is running on 2
Mar 05 15:02:52 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:53 rgmanager [pvevm] VM 100 is not running
Mar 05 15:03:23 rgmanager State change: node-04 UP
Mar 05 15:03:23 rgmanager [pvevm] VM 100 is not running
EOF
 
I just repeated another test. This time I live migrated VM 100 to node-03. Then I restarted node-01 and node-02. VM 100 was again shutdown. It restated this time on node-04. Could something be off with an index value of my cluster?
 
Ok, here is the full explanation. When a cluster member determines that it is no longer in the cluster quorum, the service manager stops all services and waits for a new quorum to form. That happens on node1 (stop VM).

When the cluster get quorate rgmanager starts the service again (chooses any node if you do not specify a fail-over domain).

see 'man clurgmgrd'
 
Thank you, Dietmar. You explanation is consistent with my test results. I've also learned from the rgmanager docs that a quorum = (0.5 * n) + 1 which also explains why a shutdown of 2/4 nodes broke the quorum and the service manager stopped the VM on node1. Now that I understand how this works, I am planning to test HA with ordered fail-over domains. From there I going to setup some Zabbix monitoring rules for these conditions.