Quorum Disk Reboots cause nodes to fence

adamb

Famous Member
Mar 1, 2012
1,329
77
113
I have a 2 node cluster using a iscsi quorum device. When I reboot or shutdown the quorum device, the two nodes fence each other off, or in some situations only one node gets fenced. This is odd because both nodes can still see each other which means they should still have quorum.

I am using a heuristic within my cluster.conf which I am unsure if that is the cause. I am still trying to pin this down.

I am running with dual fence devices.

Code:
<?xml version="1.0"?>
<cluster config_version="6" name="testprox">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <quorumd allow_kill="1" interval="3" label="cluster_qdisk" tko="10">
    <heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
  </quorumd>
  <totem token="54000"/>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="10.80.8.92" lanplus="1" login="Administrator" name="ipmi1" passwd="VQVZBCJQ" power_wait="5"/>
    <fencedevice agent="fence_ipmilan" ipaddr="10.80.8.93" lanplus="1" login="Administrator" name="ipmi2" passwd="VSNBIYWD" power_wait="5"/>
    <fencedevice agent="fence_apc" ipaddr="10.80.8.94" login="device" name="apc" passwd="medent1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="testprox1" nodeid="1" votes="1">
      <fence>
        <method name="acp">
          <device name="apc" port="1,2" secure="on"/>
        </method>
        <method name="1">
          <device name="ipmi1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="testprox2" nodeid="2" votes="1">
      <fence>
        <method name="acp">
          <device name="apc" port="3,4" secure="on"/>
        </method>
        <method name="1">
          <device name="ipmi2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100"/>
  </rm>
</cluster>
 
Here is my latest cluster.conf with the added changes from above.

<?xml version="1.0"?>
<cluster config_version="8" name="testprox">
<cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
<totem token="54000"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.8.92" lanplus="1" login="Administrator" name="ipmi1" passwd="VQVZBCJQ" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.8.93" lanplus="1" login="Administrator" name="ipmi2" passwd="VSNBIYWD" power_wait="5"/>
<fencedevice agent="fence_apc" ipaddr="10.80.8.94" login="device" name="apc" passwd="medent1"/>
</fencedevices>
<clusternodes>
<clusternode name="testprox1" nodeid="1" votes="1">
<fence>
<method name="acp">
<device name="apc" port="1,2" secure="on"/>
</method>
<method name="1">
<device name="ipmi1"/>
</method>
</fence>
</clusternode>
<clusternode name="testprox2" nodeid="2" votes="1">
<fence>
<method name="acp">
<device name="apc" port="3,4" secure="on"/>
</method>
<method name="1">
<device name="ipmi2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="100"/>
</rm>
</cluster>
 
Still pulling my hair out on this one. No doubt this is being caused by my heuristic, just unsure what I am doing wrong.

Jan 08 10:33:07 qdiskd qdiskd: read (system call) has hung for 5 seconds
Jan 08 10:33:07 qdiskd In 5 more seconds, we will be evicted
Jan 08 10:33:12 qdiskd Heuristic: 'multipath -ll | grep sd* | grep active' DOWN - Exceeded timeout of 9 seconds

It seems because the qdisk doesn't exist it causes the heuristic to fail. Even though the heuristic is technically still good as my multipath devices are still there, it seems it fails simply because the qdisk isn't there to write to.
 
It seems you are missing two_node="1" in cman.. at least this is the correct config for 2 nodes cluster with quorum
 
It seems you are missing two_node="1" in cman.. at least this is the correct config for 2 nodes cluster with quorum

That is incorrect.

http://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster


  • Edit and make the appropriate changes to the file, a) increment the "config_version" by one number, b) remove 2node="1", c) add the quorumd definition and totem timeout, something like this:

I do appreciate the input though!
 
Last edited:
Try changing from
Code:
<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
to
Code:
<quorumd votes="1" allow_kill="0" interval="3" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
I think interval of quorumd elemement needs to match interval of heuristic element
 
Try changing from
Code:
<quorumd votes="1" allow_kill="0" interval="1" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
to
Code:
<quorumd votes="1" allow_kill="0" interval="3" label="cluster_qdisk" tko="10">
<heuristic interval="3" program="multipath -ll | grep sd* | grep active" score="3" tko="5"/>
</quorumd>
I think interval of quorumd elemement needs to match interval of heuristic element

If you look at my very original post you will see that I originally had those values matching and still had the issue.
 
So I decided to change my heuristic to something simpler (ping to our gateway) and I no longer have this issue. So this is something specific with my hueristic. What kills me is this hueristic works out great up until the qdisk is gone. The heuristic works great for when I pull my storage device from the nodes so I figured that ment it was aok. I wonder what is wrong with it.
 
I moved my heuristic into a small bash script which outputs either a 0 or a 1.

0=success
1=failure

I can see the heuristic is UP and working 100% while the qdisk is part of the cluster.

Jan 08 15:44:26 qdiskd Heuristic: '/usr/bin/multipath_test' UP
Jan 08 15:44:52 qdiskd Initial score 1/1
Jan 08 15:44:52 qdiskd Initialization complete

Once I pull the quorum disk from the cluster I see this.

Jan 08 15:48:31 qdiskd qdiskd: read (system call) has hung for 15 seconds
Jan 08 15:48:31 qdiskd In 15 more seconds, we will be evicted
Jan 08 15:48:43 qdiskd Heuristic: '/usr/bin/multipath_test' DOWN - Exceeded timeout of 27 seconds

This just doesn't add up, when I pull the qdisk, my multipath_test is still reporting a 0, which according to the qdisk man page is all it needs to succeed.

Just stating the obvious here, my qdisk is not part of my multipath device/central storage. It is on a completely separate node presented over iscsi.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!