Two-Node High Availability Cluster with a quorum disk

maximeg

New Member
Jun 17, 2015
3
0
1
Hi,

I'm trying to build a two-node cluster with a quorum disk.

I can migrate a container to an other node. When a node fails, the other one knows it.
The fencing works as expected and the quorum disk too.

Everything works perfectly except the live migration of my containers when a node fails. I can't manage to relocate a container to the other working node when the first one fails. Do you have an idea? I'm out of ideas...
Thank you!

Code:
root@bare2:~# tail -f /var/log/cluster/qdiskd.log
Jun 17 22:20:43 qdiskd Writing eviction notice for node 2
Jun 17 22:20:44 qdiskd Node 2 evicted

Code:
root@bare2:~# tail -f /var/log/cluster/fenced.log
Jun 17 12:05:23 fenced fencing node bare1
Jun 17 12:06:56 fenced fence bare1 success



My cluster.conf file:

Code:
root@bare2:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="23" name="XXX">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd allow_kill="0" interval="1" label="proxmox_qdisk" tko="10" votes="1"/>
<totem token="54000"/>
<fencedevices>
...
</fencedevices>
<clusternodes>
<clusternode name="bare2" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="off" name="fence001"/>
</method>
</fence>
</clusternode>
<clusternode name="bare1" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="off" name="fence002"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<failoverdomains>
<failoverdomain name="failover1" nofailback="0" ordered="1" restricted="1">
<failoverdomainnode name="bare1" priority="1"/>
<failoverdomainnode name="bare2" priority="2"/>
</failoverdomain>
</failoverdomains>
<pvevm autostart="1" domain="failover1" recovery="relocate" vmid="101"/>
</rm>
</cluster>


And of course:

Code:
root@bare2:~# clustat
Cluster Status for XXX @ Wed Jun 17 22:16:11 2015
Member Status: Quorate


 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 bare2                                                               1 Online, Local
 bare1                                                               2 Online
 /dev/block/8:33                                                0 Online, Quorum Disk


Code:
root@bare2:~# pvecm status
Version: 6.2.0
Config Version: 23
Cluster Name: XXX
Cluster Id: 3251
Cluster Member: Yes
Cluster Generation: 292
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 178
Node name: bare2
Node ID: 1
Multicast addresses: 239.192.12.191
Node addresses: 172.16.0.2
 
Hi, quick footnote on this,

I rebooted both hosts, just for fun, and things appear? to be working precisely as I might have hoped. I didn't change anything.

To test, I then

-- added a new NFS shared storage, and it goes in fine to config, no error-timeout
-- powered off one host, so we lose 2-node-only-quorum (iSCSI quorum disk still attached to the host remaining online)
-- and things are good, I have quorum still on the one online host, life is good

I am not sure why things were fiddley initially. I will stress test it a bit more but it seems in good order now.

Tim
 
Hi, quick footnote on this,

I rebooted both hosts, just for fun, and things appear? to be working precisely as I might have hoped. I didn't change anything.

To test, I then

-- added a new NFS shared storage, and it goes in fine to config, no error-timeout
-- powered off one host, so we lose 2-node-only-quorum (iSCSI quorum disk still attached to the host remaining online)
-- and things are good, I have quorum still on the one online host, life is good

I am not sure why things were fiddley initially. I will stress test it a bit more but it seems in good order now.

Tim

Thank you Tim.
I rebooted both nodes and it works. I don't know why...but thank you!
 
I am glad this helped you out! Even funnier is that I posted my footnote on the wrong thread (same topic) and that it helped you in the process :-) Sorry that the wording of my message was so confusing - I was confused, shall we say (Rushing a bit too much, too busy today!)

Maybe now we have a good open thread question for ProxVE Developers, if this kind of behaviour is known? We appear to have 2 data points here involving quorum disks and things getting locked until clean reboot is performed.

Tim