Quorum dissolved HA VM shutdown

Ninjix · Mar 5, 2012

Is this expected behavior?

I just restarted two out of my four PVE 2.0 RC1 test nodes at the same time. The rgmanager reported the quorum dissolved as node-4 was shut down. The HA machine (VM 100) was shutdown by the cluster on node-01 even though that node remained online. Once node-3 completed its restart and the Quorum was regained, VM 100 was restarted on node-02. Any ideas on what may have caused this?

Before:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD]VM 100 running[/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[/TABLE]

During:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD]VM 100 shutdown[/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Restarting[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Restarting[/TD]
[TD][/TD]
[/TR]
[/TABLE]

After:

[TABLE="class: grid, width: 300"]
[TR]
[TD]node-01[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-02[/TD]
[TD]Online[/TD]
[TD]VM 100 running[/TD]
[/TR]
[TR]
[TD]node-03[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[TR]
[TD]node-04[/TD]
[TD]Online[/TD]
[TD][/TD]
[/TR]
[/TABLE]

Here is a copy of the rgmanager log from node-01.

Code:

...
Mar 05 15:00:07 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:08 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:38 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:49 rgmanager #1: Quorum Dissolved
Mar 05 15:00:51 rgmanager [pvevm] Task still active, waiting
Mar 05 15:00:52 rgmanager [pvevm] Task still active, waiting
Mar 05 15:02:33 rgmanager Quorum Regained
Mar 05 15:02:34 rgmanager State change: Local UP
Mar 05 15:02:34 rgmanager State change: node-02 UP
Mar 05 15:02:34 rgmanager Loading Service Data
Mar 05 15:02:36 rgmanager Skipping stop-before-start: overridden by administrator
Mar 05 15:02:36 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:40 rgmanager State change: node-03 UP
Mar 05 15:02:52 rgmanager Migration: pvevm:100 is running on 2
Mar 05 15:02:52 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:53 rgmanager [pvevm] VM 100 is not running
Mar 05 15:03:23 rgmanager State change: node-04 UP
Mar 05 15:03:23 rgmanager [pvevm] VM 100 is not running
EOF

Ninjix · Mar 5, 2012

I just repeated another test. This time I live migrated VM 100 to node-03. Then I restarted node-01 and node-02. VM 100 was again shutdown. It restated this time on node-04. Could something be off with an index value of my cluster?

dietmar · Mar 6, 2012

Ninjix said:
It restated this time on node-04.

Where do you se a restart? Looks more like an online migration (or is the VM really stopped)?

dietmar · Mar 6, 2012

Ninjix said:
Is this expected behavior?

I agree this is not optimal, but you should really avoid loosing quorum. Please can you try to specify a failover domain to test is behavior is better (see 'man rgmanager')

dietmar · Mar 6, 2012

Ok, here is the full explanation. When a cluster member determines that it is no longer in the cluster quorum, the service manager stops all services and waits for a new quorum to form. That happens on node1 (stop VM).

When the cluster get quorate rgmanager starts the service again (chooses any node if you do not specify a fail-over domain).

see 'man clurgmgrd'

Ninjix · Mar 6, 2012

Thank you, Dietmar. You explanation is consistent with my test results. I've also learned from the rgmanager docs that a quorum = (0.5 * n) + 1 which also explains why a shutdown of 2/4 nodes broke the quorum and the service manager stopped the VM on node1. Now that I understand how this works, I am planning to test HA with ordered fail-over domains. From there I going to setup some Zabbix monitoring rules for these conditions.

Search

Search

Quorum dissolved HA VM shutdown

Ninjix

Guest

Ninjix

Guest

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

Ninjix

Guest

We value your privacy