Quorum Disk Reboots cause nodes to fence

adamb · Jan 8, 2014

I have a 2 node cluster using a iscsi quorum device. When I reboot or shutdown the quorum device, the two nodes fence each other off, or in some situations only one node gets fenced. This is odd because both nodes can still see each other which means they should still have quorum.

I am using a heuristic within my cluster.conf which I am unsure if that is the cause. I am still trying to pin this down.

I am running with dual fence devices.

Code:

<?xml version="1.0"?>
<cluster config_version="6" name="testprox">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <quorumd allow_kill="1" interval="3" label="cluster_qdisk" tko="10">
    <heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
  </quorumd>
  <totem token="54000"/>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="10.80.8.92" lanplus="1" login="Administrator" name="ipmi1" passwd="VQVZBCJQ" power_wait="5"/>
    <fencedevice agent="fence_ipmilan" ipaddr="10.80.8.93" lanplus="1" login="Administrator" name="ipmi2" passwd="VSNBIYWD" power_wait="5"/>
    <fencedevice agent="fence_apc" ipaddr="10.80.8.94" login="device" name="apc" passwd="medent1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="testprox1" nodeid="1" votes="1">
      <fence>
        <method name="acp">
          <device name="apc" port="1,2" secure="on"/>
        </method>
        <method name="1">
          <device name="ipmi1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="testprox2" nodeid="2" votes="1">
      <fence>
        <method name="acp">
          <device name="apc" port="3,4" secure="on"/>
        </method>
        <method name="1">
          <device name="ipmi2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100"/>
  </rm>
</cluster>

adamb · Jan 8, 2014

I tried changing the allow_kill=1 to allow_kill=0, but it didn't seem to help.

adamb · Jan 8, 2014

Looking at

http://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster

Looks like I was missing allow_kill=0 and votes=1, neither of which are helping.

adamb · Jan 8, 2014

Here is my latest cluster.conf with the added changes from above.

<?xml version="1.0"?>
<cluster config_version="8" name="testprox">
<cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
<totem token="54000"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.8.92" lanplus="1" login="Administrator" name="ipmi1" passwd="VQVZBCJQ" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.8.93" lanplus="1" login="Administrator" name="ipmi2" passwd="VSNBIYWD" power_wait="5"/>
<fencedevice agent="fence_apc" ipaddr="10.80.8.94" login="device" name="apc" passwd="medent1"/>
</fencedevices>
<clusternodes>
<clusternode name="testprox1" nodeid="1" votes="1">
<fence>
<method name="acp">
<device name="apc" port="1,2" secure="on"/>
</method>
<method name="1">
<device name="ipmi1"/>
</method>
</fence>
</clusternode>
<clusternode name="testprox2" nodeid="2" votes="1">
<fence>
<method name="acp">
<device name="apc" port="3,4" secure="on"/>
</method>
<method name="1">
<device name="ipmi2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="100"/>
</rm>
</cluster>

adamb · Jan 8, 2014

Still pulling my hair out on this one. No doubt this is being caused by my heuristic, just unsure what I am doing wrong.

Jan 08 10:33:07 qdiskd qdiskd: read (system call) has hung for 5 seconds
Jan 08 10:33:07 qdiskd In 5 more seconds, we will be evicted
Jan 08 10:33:12 qdiskd Heuristic: 'multipath -ll | grep sd* | grep active' DOWN - Exceeded timeout of 9 seconds

It seems because the qdisk doesn't exist it causes the heuristic to fail. Even though the heuristic is technically still good as my multipath devices are still there, it seems it fails simply because the qdisk isn't there to write to.

thheo · Jan 8, 2014

It seems you are missing two_node="1" in cman.. at least this is the correct config for 2 nodes cluster with quorum

adamb · Jan 8, 2014

thheo said:
It seems you are missing two_node="1" in cman.. at least this is the correct config for 2 nodes cluster with quorum

That is incorrect.

http://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster

Edit and make the appropriate changes to the file, a) increment the "config_version" by one number, b) remove 2node="1", c) add the quorumd definition and totem timeout, something like this:

I do appreciate the input though!

mir · Jan 8, 2014

Try changing from

Code:

<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>

to

Code:

<quorumd votes="1" allow_kill="0" interval="3" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>

I think interval of quorumd elemement needs to match interval of heuristic element

adamb · Jan 8, 2014

mir said:

Try changing from

Code:

<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>

to

Code:

<quorumd votes="1" allow_kill="0" interval="3" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>

I think interval of quorumd elemement needs to match interval of heuristic element

If you look at my very original post you will see that I originally had those values matching and still had the issue.

thheo · Jan 8, 2014

adamb said:
That is incorrect.

http://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster

Edit and make the appropriate changes to the file, a) increment the "config_version" by one number, b) remove 2node="1", c) add the quorumd definition and totem timeout, something like this:

I do appreciate the input though!

Heh, I am on the learning curve, thank you for this, I corrected my config

adamb · Jan 8, 2014

thheo said:
Heh, I am on the learning curve, thank you for this, I corrected my config

No problem, in this field we are all always learning!

adamb · Jan 8, 2014

So I decided to change my heuristic to something simpler (ping to our gateway) and I no longer have this issue. So this is something specific with my hueristic. What kills me is this hueristic works out great up until the qdisk is gone. The heuristic works great for when I pull my storage device from the nodes so I figured that ment it was aok. I wonder what is wrong with it.

adamb · Jan 8, 2014

I moved my heuristic into a small bash script which outputs either a 0 or a 1.

0=success
1=failure

I can see the heuristic is UP and working 100% while the qdisk is part of the cluster.

Jan 08 15:44:26 qdiskd Heuristic: '/usr/bin/multipath_test' UP
Jan 08 15:44:52 qdiskd Initial score 1/1
Jan 08 15:44:52 qdiskd Initialization complete

Once I pull the quorum disk from the cluster I see this.

Jan 08 15:48:31 qdiskd qdiskd: read (system call) has hung for 15 seconds
Jan 08 15:48:31 qdiskd In 15 more seconds, we will be evicted
Jan 08 15:48:43 qdiskd Heuristic: '/usr/bin/multipath_test' DOWN - Exceeded timeout of 27 seconds

This just doesn't add up, when I pull the qdisk, my multipath_test is still reporting a 0, which according to the qdisk man page is all it needs to succeed.

Just stating the obvious here, my qdisk is not part of my multipath device/central storage. It is on a completely separate node presented over iscsi.

Search

Search

Quorum Disk Reboots cause nodes to fence

adamb

Famous Member

adamb

Famous Member

adamb

Famous Member

adamb

Famous Member

adamb

Famous Member

thheo

Renowned Member

adamb

Famous Member

mir

Famous Member

adamb

Famous Member

thheo

Renowned Member

adamb

Famous Member

adamb

Famous Member

adamb

Famous Member