Proxmox High Availability (HA) cluster - is it working at all for anyone?

newacecorp

Active Member
Oct 14, 2012
38
0
26
I've been playing with HA on a 3-node Proxmox 3.3 (now 3.4) cluster for about a month now.

My conclusion is that the RedHat related components relating to Fencing and Resource Management are mostly broken and if they can be made to work, work inconsistently.

Here's my scenario:

I'm using a 3-node cluster with one of the nodes acting as an NFS server. All 3 nodes are Dell R610 with 48GB ECC RAM, RAID 10 SSD/PERC H700 and redundant power supplies with IDRAC6 Enterprise with IPMI over LAN enabled and on a dedicated Ethernet port. I'm using IPMILAN for fencing. Since I'm supplying power to each server from two different UPS', it's as good as any other fencing solution.

Issues:

(1) Fencing - can't make fence_node to work out of the box, it always comes back with "agent error". Traced the issue to the check_input function in the /usr/share/fence/fencing.py file. Removed the last section where checking for device_opts is being performed. Now unfencing (fence_node -U <node>) and fencing (fence_node <node>) work as expected. fence_node -S <node> also gives back proper status from any of the nodes.

(2) CMAN / RGMANAGER don't start consistently when a node is powered up - no solution other than manually going in and starting these.

(3) When everything appears working and "fence_tool ls", "pvecm status", and "/etc/pve/.members" report the correct information on all nodes and the cluster is quorate, purposely disabling a node that has a HA VM on it does absolutely nothing. If I go to the node with the HA VM running and disable the network interface (e.g. ifconfig vmbr1 down) the other nodes report that the node is down after a short period, but it is not getting fenced (shutdown) at all and the HA VM does not migrate. If I manually issue a "fence_node <node>" command from one of the other running nodes, the node with the HA VM shuts down, but nothing happens to the HA VM. It stays attached to the node that has been fenced.

On top of this, the remaining running nodes start behaving erratically. Issuing a "pvecm status" takes 10-20 seconds and trying to do a console on a running VM doesn't work. The web interface also becomes unresponsive (which I'm accessing through one of the running nodes and on the public interface (vmbr0)).

What I expect:

(1) If I isolate a node that's part of the cluster, I expect it to be fenced (turned off) within 5-10 seconds and the HA VMs that were running on it are promptly migrated.

(2) Unfencing a node should start it up and automatically rejoin the cluster without having to manually force a "service cman start" and "service rgmanager start"


Has anyone actually made all this work, as expected? Any comments would be appreciated.

Regards,

Stephan.
 
1) For a start you could show your cluster.conf.
2) When you pull the network do you still have access to IPMILAN?
3) When a node is to be fenced which network is used? The
IPMILAN?
 
IPMI is a totally unreliable fence method (by design), only suitable for use as a second fence device.
I suggest to use a power based fence device instead.
 
1) See cluster.conf below
2) When I pull the network on a node, that node does NOT have access to IPMILAN, but the other nodes still do. I expect one of the remaining nodes (the master?) to issue the appropriate "fence_node" to isolate the node whose network was pulled. What happens if a node locks up? I would expect that other nodes in the cluster issue the appropriate fence_node command and not rely on the node to do this itself (as it may be locked up and can't do it). As I'm thinking about this, I would expect the node that is to receive the HA VM to issue the appropriate fence_node command to the node that the HA VM is being taken from.
3) It's using an internal network attached to vmbr1

<?xml version="1.0"?>

<cluster config_version="41" name="newace">
<cman two_node="0" expected_votes="4" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.5" lanplus="1" login="admin" name="m5-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.6" lanplus="1" login="admin" name="m6-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.7" lanplus="1" login="admin" name="m7-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
</fencedevices>
<clusternodes>
<clusternode name="m7" nodeid="2" votes="3">
<fence>
<method name="1">
<device action="off" name="m7-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m7-ipmi"/>
</unfence>
</clusternode>
<clusternode name="m5" nodeid="3" votes="1">
<fence>
<method name="1">
<device action="off" name="m5-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m5-ipmi"/>
</unfence>
</clusternode>
<clusternode name="m6" votes="1" nodeid="1">
<fence>
<method name="1">
<device action="off" name="m6-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m6-ipmi"/>
</unfence>
</clusternode>
</clusternodes>
<rm>
<failoverdomains>
<failoverdomain name="nodefailover" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="m5"/>
<failoverdomainnode name="m6"/>
<failoverdomainnode name="m7"/>
</failoverdomain>
</failoverdomains>
<pvevm autostart="1" vmid="100" domain="nodefailover" recovery="relocate"/>
<pvevm autostart="1" vmid="201" domain="nodefailover" recovery="relocate"/>
</rm>
</cluster>
 
Hello Dietmar,

I keep hearing this, but don't understand the argument. We have redundant power supplies and these are being fed from two separate UPS'. Why would fencing a network switch be any different (especially since most switches only have single power supplies). Your comments would be appreciated.

Regards,

Stephan.
 
Dietmar,

I should clarify that we're using IPMI over LAN to a DRAC6 Enterprise card with a dedicated Ethernet port connection.

Let me know what issues you see this.

Regards,

Stephan.
 
When you pull the network the fence daemon tries to fence your node over the cluster network but since your network is disconnected there is no way the fence daemon can reliably detect whether the node is really down or not. This is why your VM's on the 'fenced' node is not brought up on another node. Another alternative to Dietmar's proposal could be fencing through snmp to your cluster switch in which case the fence daemon will instruct the switch to close the port(s) connecting your node with the cluster. This method will alway work even when you pull a network.
 
Hello Mir,

Thanks for your comments. I think there may be a misunderstanding. I'm not using software IPMI to a particular node. I'm using IPMI to a DRAC6 installed on each of the nodes. Therefore, if I pull the network on one node, the IPMI interface for that node is still accessible because it is provided by the DRAC of that node. In fact, once I pull the network on that node, I can go to any other node and query the chassis power status of the node whose network I've pulled.

Therefore, this is exactly the same as if I were to fence off a network port. Please let me know your thoughts.

Regards,

Stephan.
 
In your cluster.conf you configured 4 votes:

<cman two_node="0" expected_votes="4" keyfile="/var/lib/pve-cluster/corosync.authkey"/>

I'm no expert, so i don't know for sure but how can you get 4 votes in a three node cluster? Even with quorum disk, you won't get a quorum if one node fails.

Please correct me, if i'm wrong.
 
This is done on purpose. One of the nodes is an NFS server (m7). It has 3 votes, the other nodes only have 1 vote each. This means that to reach quorum m7 and at least one other node must be active.
 
I'm not using software IPMI to a particular node. I'm using IPMI to a DRAC6 installed on each of the nodes. Therefore, if I pull the network on one node, the IPMI interface for that node is still accessible because it is provided by the DRAC of that node.

But it still fails if you loose power on that node!
 
But how is that different from loosing power to a switch that is used for fencing? In this case, we're using redundant power supplies to two different UPS'. We're actually using two switches, one is powered by power source A and the other one by power source B. (e.g. A&B power).
 
I've not yet seen any responses of anyone successfully using a Proxmox cluster in a HA production environment. Can anyone let me know their experiences? I'm eager to use this in our production environment and would really like to hear from others who have deployed it.
 
I have two proxmox clusters, one running a few HA VMs, the other not currently running HA VMs but have tested doing so and fencing does work on it which has been quite helpful when there was the occasional kernel panic or lockup.

In both clusers I use APC PDUs. Some of my nodes have two power supplies, for them I simply specify the fencing like this:
Code:
  <clusternode name="vm18" votes="1" nodeid="20">
    <fence>
        <method name="power-dual">
          <device name="pdu7" port="1" secure="on" action="off"/>
          <device name="pdu4" port="5" secure="on" action="off"/>
          <device name="pdu7" port="1" secure="on" action="on"/>
          <device name="pdu4" port="5" secure="on" action="on"/>
        </method>
      </fence>
  </clusternode>

To make things easier to diagnose forget about the HA VM for the moment, get fencing working.
If a node disappears from the cluster, the cluster will fence it. An HA VM is not necessary for fencing to work.

Have you looked at the logs in /var/log/cluster/ ?
There should be some sort of clue in there as to what the problem is particularly /var/log/cluster/fenced.log.

Whats the output of clustat before and after a node failure?
 
I managed to get the cluster back up, but not without some inconsistencies.

I had to restart cman, rgmanager and pve-cluster services several times on each node in order to get everything synced back up.

All nodes now show the same information when clustat, pvecm status, fence_tool ls are run on the respective nodes.

However, node m5 is showing "red" / offline in the Web interface and trying to migrate a HA VM to it yields:

root@m6:~# clusvcadm -M pvevm:100 -m m5
Trying to migrate pvevm:100 to m5...Failed; service running on original owner


but trying to migrate it to m7 works

root@m6:~# clusvcadm -M pvevm:100 -m m7
Trying to migrate pvevm:100 to m7...Success

So, something is still borked, despite the fact that the various tools say that everything should be OK.

I checked fenced.log but there's nothing interesting and rgmanager.log contains a bunch of errors from my attempts to get the cluster re-synced after shutting down one node.

root@m6:~# clustat
Cluster Status for newace @ Wed Feb 25 00:04:31 2015
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
m6 1 Online, Local, rgmanager
m7 2 Online, rgmanager
m5 3 Online, rgmanager


Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:100 m6 started
pvevm:201 m6 started


root@m6:~# pvecm status
Version: 6.2.0
Config Version: 41
Cluster Name: newace
Cluster Id: 6775
Cluster Member: Yes
Cluster Generation: 2096
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: m6
Node ID: 1
Multicast addresses: 239.192.26.145
Node addresses: 10.2.0.6

root@m6:~# fence_tool ls
fence domain
member count 3
victim count 0
victim now 0
master nodeid 1
wait state none
members 1 2 3
 
I tried doing a "service pvestatd restart" on m5, but this is Proxmox 3.4 and issues relating to that service stopping have long been fixed, so it didn't do anything.

I had to actually do a "service pve-cluster restart" on m5 in order for it to show the node as "green" in the web interface. Not sure why this failed to start when the node was cleanly rebooted.

It seems like a lot of manual intervention is necessary to get the cluster back up and running. This is not something you want when you're running in a production environment.
 
Isolating a node (M5):

root@m5:~# ifconfig vmbr1 down
Message from syslogd@m5 at Feb 25 00:25:50 ...
rgmanager[6573]: #1: Quorum Dissolved

After a few seconds... the other nodes notice:

Feb 25 00:25:52 m7 corosync[3306]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] New Configuration:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.6)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.7)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Left:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.5)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Joined:
Feb 25 00:25:52 m7 corosync[3306]: [QUORUM] Members[2]: 1 2
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] New Configuration:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.6)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.7)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Left:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Joined:
Feb 25 00:25:52 m7 corosync[3306]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 00:25:52 m7 rgmanager[6410]: State change: m5 DOWN


root@m7:~# clustat
Cluster Status for newace @ Wed Feb 25 00:30:05 2015
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
m6 1 Online
m7 2 Online, Local
m5 3 Offline



However, M5 is not powered off. The real problem is that the entire cluster becomes slow. Trying to run "pvecm status" on the remaining active nodes (M6 and M7) takes forever and the Web interface doesn't update anymore. I should point out that I'm accessing the web interface through the public interface of one of the remaining nodes. The whole cluster basically is on its knees because of the disappearance of one node.

As an FYI, I can run "fence_node m5" from either M6 or M7 and M5 does get powered down.

Any suggestions as to why the cluster is becoming unresponsive?

Regards,

Stephan.
 
Dietmar,

I should clarify that we're using IPMI over LAN to a DRAC6 Enterprise card with a dedicated Ethernet port connection.

Let me know what issues you see this.

Regards,

Stephan.

I never have had this kind of problem with drac6 as fencing devices. Works out of the box.

But i'm not using ipmilan fencing, I'm using fence_drac5 which use ssh to connect to drac card.
 
You can enable IPMI over LAN on the DRAC card so that it listens to port 683. You can use fence_ipmilan. It works exactly the same as fence_drac, except it is much faster because it doesn't need to use the whole TLS exchange of SSH.

As far as I'm concerned, IPMI with a DRAC card is as good as switch based fencing. I've yet to hear a decent argument to the contrary.

Regards,

Stephan.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!