Another "red node" cluster question - 2 node HA cluster

ntblade

Renowned Member
Apr 29, 2011
22
2
68
Hi all,
I'm trying again to setup a 2 node HA cluster, my setup is 2 identical nodes each with a fresh 3.4 install with:
1 x 60GB SSD for proxmox and 1 x 1TB drive for DRBD
3 x 1G NICs, (one bond0 / balnce-rr) for DRBD network
1 x 10M iSCSI target on an ubuntu host for quroum disk
Fencing is by SNMP to an HP 1920G switch - When a node is fenced all network ports that it is connected to are disabled as if the power had been pulled to that node.

My problem is that I keep getting one node being marked red in the web gui. I've tried lots of restarting services, rebooting etc. I've read lots of similar threads here but there seems to be no reason for the cluster to be failing.

Here's my cluster.conf:

Code:
<?xml version="1.0"?><cluster name="athomeinel" config_version="4">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <quorumd votes="1" allow_kill="0" interval="1" label="pveqdisk" tko="10"/>
  <totem token="54000"/>
  <clusternodes>
    <clusternode name="pve1" nodeid="1" votes="1">
        <fence>
          <method name="fence">
            <device action="off" name="HP1910" port="11"/>
            <device action="off" name="HP1910" port="13"/>
            <device action="off" name="HP1910" port="15"/>
          </method>
        </fence>
    </clusternode>
    <clusternode name="pve2" nodeid="2" votes="1">
        <fence>
          <method name="fence">
            <device action="off" name="HP1910" port="12"/>
            <device action="off" name="HP1910" port="14"/>
            <device action="off" name="HP1910" port="16"/>
          </method>
        </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_ifmib" community="fencing" ipaddr="172.16.12.250" name="HP1910" snmp_version="2c"/>
  </fencedevices>
  <rm>
  </rm>
</cluster>

My drbd global_common.conf:
Code:
global { usage-count no; }
common {
        syncer { rate 30M; verify-alg md5; }
        handlers { out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; }
}

My drbd r0.res:
Code:
resource r0 {
        protocol C;
        startup {
                wfc-timeout  0;     # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
                degr-wfc-timeout 60;
                become-primary-on both;
        }
        net {
                cram-hmac-alg sha1;
                shared-secret "my-secret";
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                #data-integrity-alg crc32c;     # has to be enabled only for test and disabled for production use (check man drbd.conf, section "NOTES ON DATA INTEGRITY")
        }
        on pve1 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 172.16.12.251:7788;
                meta-disk internal;
        }
        on pve2 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 172.16.12.252:7788;
                meta-disk internal;
        }
    disk {
        # no-disk-barrier and no-disk-flushes should be applied only to systems with non-volatile (battery backed) controller caches.
        # Follow links for more information:
        # http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html#s-tune-disable-barriers
        # http://www.drbd.org/users-guide/s-throughput-tuning.html#s-tune-disable-barriers
        no-disk-barrier;
        no-disk-flushes;
    }
}

Could someone please help me understand what's going wrong? I going round in circles :(

Thanks for reading,

NTB