Proxmox High Availability (HA) cluster - is it working at all for anyone?

newacecorp · Feb 24, 2015

I've been playing with HA on a 3-node Proxmox 3.3 (now 3.4) cluster for about a month now.

My conclusion is that the RedHat related components relating to Fencing and Resource Management are mostly broken and if they can be made to work, work inconsistently.

Here's my scenario:

I'm using a 3-node cluster with one of the nodes acting as an NFS server. All 3 nodes are Dell R610 with 48GB ECC RAM, RAID 10 SSD/PERC H700 and redundant power supplies with IDRAC6 Enterprise with IPMI over LAN enabled and on a dedicated Ethernet port. I'm using IPMILAN for fencing. Since I'm supplying power to each server from two different UPS', it's as good as any other fencing solution.

Issues:

(1) Fencing - can't make fence_node to work out of the box, it always comes back with "agent error". Traced the issue to the check_input function in the /usr/share/fence/fencing.py file. Removed the last section where checking for device_opts is being performed. Now unfencing (fence_node -U <node>) and fencing (fence_node <node>) work as expected. fence_node -S <node> also gives back proper status from any of the nodes.

(2) CMAN / RGMANAGER don't start consistently when a node is powered up - no solution other than manually going in and starting these.

(3) When everything appears working and "fence_tool ls", "pvecm status", and "/etc/pve/.members" report the correct information on all nodes and the cluster is quorate, purposely disabling a node that has a HA VM on it does absolutely nothing. If I go to the node with the HA VM running and disable the network interface (e.g. ifconfig vmbr1 down) the other nodes report that the node is down after a short period, but it is not getting fenced (shutdown) at all and the HA VM does not migrate. If I manually issue a "fence_node <node>" command from one of the other running nodes, the node with the HA VM shuts down, but nothing happens to the HA VM. It stays attached to the node that has been fenced.

On top of this, the remaining running nodes start behaving erratically. Issuing a "pvecm status" takes 10-20 seconds and trying to do a console on a running VM doesn't work. The web interface also becomes unresponsive (which I'm accessing through one of the running nodes and on the public interface (vmbr0)).

What I expect:

(1) If I isolate a node that's part of the cluster, I expect it to be fenced (turned off) within 5-10 seconds and the HA VMs that were running on it are promptly migrated.

(2) Unfencing a node should start it up and automatically rejoin the cluster without having to manually force a "service cman start" and "service rgmanager start"

Has anyone actually made all this work, as expected? Any comments would be appreciated.

Regards,

Stephan.

mir · Feb 24, 2015

1) For a start you could show your cluster.conf.
2) When you pull the network do you still have access to IPMILAN?
3) When a node is to be fenced which network is used? The IPMILAN?

dietmar · Feb 24, 2015

IPMI is a totally unreliable fence method (by design), only suitable for use as a second fence device.
I suggest to use a power based fence device instead.

newacecorp · Feb 24, 2015

1) See cluster.conf below
2) When I pull the network on a node, that node does NOT have access to IPMILAN, but the other nodes still do. I expect one of the remaining nodes (the master?) to issue the appropriate "fence_node" to isolate the node whose network was pulled. What happens if a node locks up? I would expect that other nodes in the cluster issue the appropriate fence_node command and not rely on the node to do this itself (as it may be locked up and can't do it). As I'm thinking about this, I would expect the node that is to receive the HA VM to issue the appropriate fence_node command to the node that the HA VM is being taken from.
3) It's using an internal network attached to vmbr1

<?xml version="1.0"?>
<cluster config_version="41" name="newace">
<cman two_node="0" expected_votes="4" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.5" lanplus="1" login="admin" name="m5-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.6" lanplus="1" login="admin" name="m6-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
<fencedevice agent="fence_ipmilan" cipher="1" ipaddr="10.2.2.7" lanplus="1" login="admin" name="m7-ipmi" passwd="XXXXXXXX" power_wait="5" method="onoff"/>
</fencedevices>
<clusternodes>
<clusternode name="m7" nodeid="2" votes="3">
<fence>
<method name="1">
<device action="off" name="m7-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m7-ipmi"/>
</unfence>
</clusternode>
<clusternode name="m5" nodeid="3" votes="1">
<fence>
<method name="1">
<device action="off" name="m5-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m5-ipmi"/>
</unfence>
</clusternode>
<clusternode name="m6" votes="1" nodeid="1">
<fence>
<method name="1">
<device action="off" name="m6-ipmi"/>
</method>
</fence>
<unfence>
<device action="on" name="m6-ipmi"/>
</unfence>
</clusternode>
</clusternodes>
<rm>
<failoverdomains>
<failoverdomain name="nodefailover" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="m5"/>
<failoverdomainnode name="m6"/>
<failoverdomainnode name="m7"/>
</failoverdomain>
</failoverdomains>
<pvevm autostart="1" vmid="100" domain="nodefailover" recovery="relocate"/>
<pvevm autostart="1" vmid="201" domain="nodefailover" recovery="relocate"/>
</rm>
</cluster>

newacecorp · Feb 24, 2015

Hello Dietmar,

I keep hearing this, but don't understand the argument. We have redundant power supplies and these are being fed from two separate UPS'. Why would fencing a network switch be any different (especially since most switches only have single power supplies). Your comments would be appreciated.

Regards,

Stephan.

newacecorp · Feb 24, 2015

Dietmar,

I should clarify that we're using IPMI over LAN to a DRAC6 Enterprise card with a dedicated Ethernet port connection.

Let me know what issues you see this.

Regards,

Stephan.

mir · Feb 24, 2015

When you pull the network the fence daemon tries to fence your node over the cluster network but since your network is disconnected there is no way the fence daemon can reliably detect whether the node is really down or not. This is why your VM's on the 'fenced' node is not brought up on another node. Another alternative to Dietmar's proposal could be fencing through snmp to your cluster switch in which case the fence daemon will instruct the switch to close the port(s) connecting your node with the cluster. This method will alway work even when you pull a network.

newacecorp · Feb 24, 2015

Hello Mir,

Thanks for your comments. I think there may be a misunderstanding. I'm not using software IPMI to a particular node. I'm using IPMI to a DRAC6 installed on each of the nodes. Therefore, if I pull the network on one node, the IPMI interface for that node is still accessible because it is provided by the DRAC of that node. In fact, once I pull the network on that node, I can go to any other node and query the chassis power status of the node whose network I've pulled.

Therefore, this is exactly the same as if I were to fence off a network port. Please let me know your thoughts.

Regards,

Stephan.

MrCrankHank · Feb 24, 2015

In your cluster.conf you configured 4 votes:

<cman two_node="0" expected_votes="4" keyfile="/var/lib/pve-cluster/corosync.authkey"/>

I'm no expert, so i don't know for sure but how can you get 4 votes in a three node cluster? Even with quorum disk, you won't get a quorum if one node fails.

Please correct me, if i'm wrong.

newacecorp · Feb 24, 2015

This is done on purpose. One of the nodes is an NFS server (m7). It has 3 votes, the other nodes only have 1 vote each. This means that to reach quorum m7 and at least one other node must be active.

dietmar · Feb 24, 2015

newacecorp said:
I'm not using software IPMI to a particular node. I'm using IPMI to a DRAC6 installed on each of the nodes. Therefore, if I pull the network on one node, the IPMI interface for that node is still accessible because it is provided by the DRAC of that node.

But it still fails if you loose power on that node!

newacecorp · Feb 24, 2015

But how is that different from loosing power to a switch that is used for fencing? In this case, we're using redundant power supplies to two different UPS'. We're actually using two switches, one is powered by power source A and the other one by power source B. (e.g. A&B power).

newacecorp · Feb 25, 2015

I've not yet seen any responses of anyone successfully using a Proxmox cluster in a HA production environment. Can anyone let me know their experiences? I'm eager to use this in our production environment and would really like to hear from others who have deployed it.

e100 · Feb 25, 2015

I have two proxmox clusters, one running a few HA VMs, the other not currently running HA VMs but have tested doing so and fencing does work on it which has been quite helpful when there was the occasional kernel panic or lockup.

In both clusers I use APC PDUs. Some of my nodes have two power supplies, for them I simply specify the fencing like this:

Code:

  <clusternode name="vm18" votes="1" nodeid="20">
    <fence>
        <method name="power-dual">
          <device name="pdu7" port="1" secure="on" action="off"/>
          <device name="pdu4" port="5" secure="on" action="off"/>
          <device name="pdu7" port="1" secure="on" action="on"/>
          <device name="pdu4" port="5" secure="on" action="on"/>
        </method>
      </fence>
  </clusternode>

To make things easier to diagnose forget about the HA VM for the moment, get fencing working.
If a node disappears from the cluster, the cluster will fence it. An HA VM is not necessary for fencing to work.

Have you looked at the logs in /var/log/cluster/ ?
There should be some sort of clue in there as to what the problem is particularly /var/log/cluster/fenced.log.

Whats the output of clustat before and after a node failure?

newacecorp · Feb 25, 2015

I managed to get the cluster back up, but not without some inconsistencies.

I had to restart cman, rgmanager and pve-cluster services several times on each node in order to get everything synced back up.

All nodes now show the same information when clustat, pvecm status, fence_tool ls are run on the respective nodes.

However, node m5 is showing "red" / offline in the Web interface and trying to migrate a HA VM to it yields:

root@m6:~# clusvcadm -M pvevm:100 -m m5
Trying to migrate pvevm:100 to m5...Failed; service running on original owner

but trying to migrate it to m7 works

root@m6:~# clusvcadm -M pvevm:100 -m m7
Trying to migrate pvevm:100 to m7...Success

So, something is still borked, despite the fact that the various tools say that everything should be OK.

I checked fenced.log but there's nothing interesting and rgmanager.log contains a bunch of errors from my attempts to get the cluster re-synced after shutting down one node.

root@m6:~# clustat
Cluster Status for newace @ Wed Feb 25 00:04:31 2015
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
m6 1 Online, Local, rgmanager
m7 2 Online, rgmanager
m5 3 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:100 m6 started
pvevm:201 m6 started

root@m6:~# pvecm status
Version: 6.2.0
Config Version: 41
Cluster Name: newace
Cluster Id: 6775
Cluster Member: Yes
Cluster Generation: 2096
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: m6
Node ID: 1
Multicast addresses: 239.192.26.145
Node addresses: 10.2.0.6

root@m6:~# fence_tool ls
fence domain
member count 3
victim count 0
victim now 0
master nodeid 1
wait state none
members 1 2 3

newacecorp · Feb 25, 2015

I tried doing a "service pvestatd restart" on m5, but this is Proxmox 3.4 and issues relating to that service stopping have long been fixed, so it didn't do anything.

I had to actually do a "service pve-cluster restart" on m5 in order for it to show the node as "green" in the web interface. Not sure why this failed to start when the node was cleanly rebooted.

It seems like a lot of manual intervention is necessary to get the cluster back up and running. This is not something you want when you're running in a production environment.

newacecorp · Feb 25, 2015

Isolating a node (M5):

root@m5:~# ifconfig vmbr1 down
Message from syslogd@m5 at Feb 25 00:25:50 ...
rgmanager[6573]: #1: Quorum Dissolved

After a few seconds... the other nodes notice:

Feb 25 00:25:52 m7 corosync[3306]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] New Configuration:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.6)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.7)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Left:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.5)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Joined:
Feb 25 00:25:52 m7 corosync[3306]: [QUORUM] Members[2]: 1 2
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] New Configuration:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.6)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] #011r(0) ip(10.2.0.7)
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Left:
Feb 25 00:25:52 m7 corosync[3306]: [CLM ] Members Joined:
Feb 25 00:25:52 m7 corosync[3306]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 00:25:52 m7 rgmanager[6410]: State change: m5 DOWN

root@m7:~# clustat
Cluster Status for newace @ Wed Feb 25 00:30:05 2015
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
m6 1 Online
m7 2 Online, Local
m5 3 Offline

However, M5 is not powered off. The real problem is that the entire cluster becomes slow. Trying to run "pvecm status" on the remaining active nodes (M6 and M7) takes forever and the Web interface doesn't update anymore. I should point out that I'm accessing the web interface through the public interface of one of the remaining nodes. The whole cluster basically is on its knees because of the disappearance of one node.

As an FYI, I can run "fence_node m5" from either M6 or M7 and M5 does get powered down.

Any suggestions as to why the cluster is becoming unresponsive?

Regards,

Stephan.

spirit · Feb 25, 2015

newacecorp said:
Dietmar,

I should clarify that we're using IPMI over LAN to a DRAC6 Enterprise card with a dedicated Ethernet port connection.

Let me know what issues you see this.

Regards,

Stephan.

I never have had this kind of problem with drac6 as fencing devices. Works out of the box.

But i'm not using ipmilan fencing, I'm using fence_drac5 which use ssh to connect to drac card.

newacecorp · Feb 25, 2015

You can enable IPMI over LAN on the DRAC card so that it listens to port 683. You can use fence_ipmilan. It works exactly the same as fence_drac, except it is much faster because it doesn't need to use the whole TLS exchange of SSH.

As far as I'm concerned, IPMI with a DRAC card is as good as switch based fencing. I've yet to hear a decent argument to the contrary.

Regards,

Stephan.

dietmar · Feb 25, 2015

newacecorp said:
As far as I'm concerned, IPMI with a DRAC card is as good as switch based fencing. I've yet to hear a decent argument to the contrary.

What happens if you loose power on the DRAC node?

Proxmox High Availability (HA) cluster - is it working at all for anyone?

Active Member

Famous Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Famous Member

Active Member

Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Distinguished Member

Active Member

Proxmox Staff Member