Quorum dissolved HA VM shutdown

N

Ninjix

Guest
Is this expected behavior?

I just restarted two out of my four PVE 2.0 RC1 test nodes at the same time. The rgmanager reported the quorum dissolved as node-4 was shut down. The HA machine (VM 100) was shutdown by the cluster on node-01 even though that node remained online. Once node-3 completed its restart and the Quorum was regained, VM 100 was restarted on node-02. Any ideas on what may have caused this?

Before:

node-01OnlineVM 100 running
node-02Online
node-03Online
node-04Online

During:

node-01OnlineVM 100 shutdown
node-02Online
node-03Restarting
node-04Restarting


After:

node-01Online
node-02OnlineVM 100 running
node-03Online
node-04Online

Here is a copy of the rgmanager log from node-01.

Code:
...
Mar 05 15:00:07 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:08 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:38 rgmanager [pvevm] VM 100 is running
Mar 05 15:00:49 rgmanager #1: Quorum Dissolved
Mar 05 15:00:51 rgmanager [pvevm] Task still active, waiting
Mar 05 15:00:52 rgmanager [pvevm] Task still active, waiting
Mar 05 15:02:33 rgmanager Quorum Regained
Mar 05 15:02:34 rgmanager State change: Local UP
Mar 05 15:02:34 rgmanager State change: node-02 UP
Mar 05 15:02:34 rgmanager Loading Service Data
Mar 05 15:02:36 rgmanager Skipping stop-before-start: overridden by administrator
Mar 05 15:02:36 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:40 rgmanager State change: node-03 UP
Mar 05 15:02:52 rgmanager Migration: pvevm:100 is running on 2
Mar 05 15:02:52 rgmanager [pvevm] VM 100 is not running
Mar 05 15:02:53 rgmanager [pvevm] VM 100 is not running
Mar 05 15:03:23 rgmanager State change: node-04 UP
Mar 05 15:03:23 rgmanager [pvevm] VM 100 is not running
EOF
 
I just repeated another test. This time I live migrated VM 100 to node-03. Then I restarted node-01 and node-02. VM 100 was again shutdown. It restated this time on node-04. Could something be off with an index value of my cluster?
 
Ok, here is the full explanation. When a cluster member determines that it is no longer in the cluster quorum, the service manager stops all services and waits for a new quorum to form. That happens on node1 (stop VM).

When the cluster get quorate rgmanager starts the service again (chooses any node if you do not specify a fail-over domain).

see 'man clurgmgrd'
 
Thank you, Dietmar. You explanation is consistent with my test results. I've also learned from the rgmanager docs that a quorum = (0.5 * n) + 1 which also explains why a shutdown of 2/4 nodes broke the quorum and the service manager stopped the VM on node1. Now that I understand how this works, I am planning to test HA with ordered fail-over domains. From there I going to setup some Zabbix monitoring rules for these conditions.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!