is there an alternative to fencing for HA?

hotwired007

Member
Sep 19, 2011
533
7
16
UK
Hi Guys,

I'm looking into setting up a proxmox HA environment but i have no fencing capable devices. I have a selection of Dell PowerEdge 1950s, 860s, and SC1425s with 2 ReadyNAS 2100s, 2 Netgear switches, 2 Cisco PIX firewall/routers and 2 EATON 5130 UPS in a Rack with 2 Basic PDUs (no interface). i am not able to purchase any further equipment.

My understanding of Fencing requires a SINGLE device to sit and monitor the state of all the servers.

is there no way the HA could be set to use a heartbeat monitor? I have 2 NICs per server, there must be a way for the servers to constantly monitor each other in the cluster and then migrate machines if they loose contact with one of the servers in the cluster? isnt this how quorum works?

has anyone else come accross this?
 
is there no way the HA could be set to use a heartbeat monitor? I have 2 NICs per server, there must be a way for the servers to constantly monitor each other in the cluster and then migrate machines if they loose contact with one of the servers in the cluster? isnt this how quorum works?

Monitoring and fencing is something different, and you need both for HA.
 
I dont understand the need for fencing.

My understanding of HA (3 machine cluster):

Server 1 checks Server 2 & Server 3
Server 2 checks Server 1 & Server 3
Server 3 checks Server 1 & Server 2

If Server 2 & 3 both cannot connect to Server 1 all VM that were running on server 1 are reloaded on server 2.

When server 1 comes online it reloads cluster config and is told that it has no VM control. User can then migrate VMs back to server 1.
 
i should have phrased the question better - is there an alternative to a fencing device for HA?

I have no devices that appear to be fencing capable. i have no drac crads in any of my dell servers.
 
you need at least one fence device per host, otherwise HA is not possible.
 
I realize fencing is for more than just trying to restart the failed host but could you use lack of Ping then WOL to try the restart then goto fail mode after a time out?
 
is it possible to get a 'software' fencing device?

in many cases you are not bound to having *power* fencing available, and "network fencing" could be appropriate as well. That means, the failed node can be "isolated" from the cluster, e.g. by using a network switch to down the ports.
In theory, similar could be achieved by dropping the network links to the failed node by the remaining nodes *by software* (dropping nic module etc.) - the problem is "reliability", because this might fail sometimes...
Might be enough to "play around with", though, but...
 
So just to clarify, HA will NOT work without a fence device?

I was looking at the HA offering from VMWare:

To monitor physical servers, an agent on each
server maintains a heartbeat with the other servers in the
resource pool such that a loss of heartbeat automatically
initiates the restart of all affected virtual machines on other
servers in the resource pool.

Is this not possible to have with Proxmox? as the 'Agent' would surely be the Proxmox hypervisor installed on each physical server?
 

Hi Tom,

I've read the requirements, and it specifically states that a fencing device is required. It doesn't explain why fencing was decided upon.

I'm asking if there is a possibility that you will review this and implement something that doesn't require a fencing device. I can't see why the fencing device is necessery, when the servers could be powered off simply using a command from nix and if required powered back on again using wake on lan.

In comparison, VMware & Hyper-V require no futher hardware than the physical server, and network devices for HA.

It to me seems to be a backwards step for someone leaving VMware for Proxmox to be forced into buying an additional piece of hardware.
 
just to clarifly, HA cluster done with DRBD as in http://pve.proxmox.com/wiki/DRBD <<DRBD® refers to block devices designed as a building block to form high availability (HA) clusters.>> are not supported anymore, and were not really HA without fencing devices...? I am confused, ha is really something i need to get into...

Thanks
Marco
 
just to clarifly, HA cluster done with DRBD as in http://pve.proxmox.com/wiki/DRBD <<DRBD® refers to block devices designed as a building block to form high availability (HA) clusters.>> are not supported anymore, and were not really HA without fencing devices...? I am confused, ha is really something i need to get into...

Thanks
Marco

HA works fine with DRBD in all of my testing so far.
I am using APC PDU for fencing.
The fencing also prevents DRBD from becoming split-brain when HA kicks in.


DRBD is a perfect example of why fencing is needed.
Most people setup DRBD replication on a dedicated network.
So in most cases the DRBD replication is a separate isolated network from the Proxmox cluster network.

In our hypothetical setup Proxmox cluster communicates on eth0 and DRBD communicates on eth1.
Node A is running HA VM 101 and everything is running along fine until someone bumps a cable unplugging eth0 on Node A.
Proxmox cluster can not see eth0 so that node is assumed dead.
However it is not really dead and VM 101 is still running and still replicating data via DRBD on eth1.

With fencing here is what happens:
Node A is turned off(fenced)
Then VM 101 is started on Node B
When Node A comes back up DRBD reconnects and life moves on and everyone is happy.

Without fencing here is what happens:
VM 101 keeps running on NodeA replicating data via DRBD.
VM 101 is started on Node B
Now your VM 101 disk is fubar with corruption beyond belief because you have two VMs running at the same time writing to the same disk, ouch.
Instead of HA you now have a disaster on your hands, time to get the backups.

The last thing any of us will ever want to happen is to see the HA system mess up and actually CAUSE a problem rather than preventing it.
That is why fencing is required.
It is imperative that the node you "think" is dead, is actually dead.
Just because you can not ping it does not make it dead.
Just because it is not responding to the cluster does not mean it can not cause problems.

What if cman crashes on Node A?
Or you mess up the network config for the cluster?
Or some other unforeseen odd event happens?

The only 100% positive method to ensure that the VM is not still running, thus making it safe to start it elsewhere, is to fence the node that is no longer responding to the rest of the cluster.
 
It to me seems to be a backwards step for someone leaving VMware for Proxmox to be forced into buying an additional piece of hardware.

Fencing is required and needed.
I would be sceptical of any system that did not use fencing.
When dealing with shared resources it is imperative to ensure that only one device is using a specific resource.

It is NOT required that you use ILO/IPMI/APC PDU for fencing.
What is required is that you setup fencing to ensure that there is never a conflict to shared resources.

Have a managed switch and all shared storage is accessed over that switch?
If yes, maybe you could use fence_ifmib

Is your shared storage accessed over a brocade FC switch?
If yes, maybe you could use fence_brocade

There are lots of fence agents available:
Code:
# ls /usr/sbin/fence_*
/usr/sbin/fence_ack_manual   /usr/sbin/fence_ilo_mp
/usr/sbin/fence_alom         /usr/sbin/fence_intelmodular
/usr/sbin/fence_apc         /usr/sbin/fence_ipmilan
/usr/sbin/fence_apc_snmp     /usr/sbin/fence_ldom
/usr/sbin/fence_baytech      /usr/sbin/fence_lpar
/usr/sbin/fence_bladecenter  /usr/sbin/fence_mcdata
/usr/sbin/fence_brocade      /usr/sbin/fence_na
/usr/sbin/fence_bullpap      /usr/sbin/fence_node
/usr/sbin/fence_cisco_mds    /usr/sbin/fence_nss_wrapper
/usr/sbin/fence_cisco_ucs    /usr/sbin/fence_rackswitch
/usr/sbin/fence_cpint         /usr/sbin/fence_rsa
/usr/sbin/fence_drac         /usr/sbin/fence_rsb
/usr/sbin/fence_drac5         /usr/sbin/fence_sanbox2
/usr/sbin/fence_eaton_snmp   /usr/sbin/fence_scsi
/usr/sbin/fence_egenera      /usr/sbin/fence_tool
/usr/sbin/fence_eps         /usr/sbin/fence_vixel
/usr/sbin/fence_ibmblade     /usr/sbin/fence_wti
/usr/sbin/fence_ifmib         /usr/sbin/fence_xcat
/usr/sbin/fence_ilo         /usr/sbin/fence_zvm


You can even setup multiple fence devices
http://docs.redhat.com/docs/en-US/R...#ex-clusterconf-fencing-multi-per-node-cli-CA
 
HA works fine with DRBD in all of my testing so far.
I am using APC PDU for fencing.
The fencing also prevents DRBD from becoming split-brain when HA kicks in.


DRBD is a perfect example of why fencing is needed.
Most people setup DRBD replication on a dedicated network.
So in most cases the DRBD replication is a separate isolated network from the Proxmox cluster network.

In our hypothetical setup Proxmox cluster communicates on eth0 and DRBD communicates on eth1.
Node A is running HA VM 101 and everything is running along fine until someone bumps a cable unplugging eth0 on Node A.
Proxmox cluster can not see eth0 so that node is assumed dead.
However it is not really dead and VM 101 is still running and still replicating data via DRBD on eth1.

With fencing here is what happens:
Node A is turned off(fenced)
Then VM 101 is started on Node B
When Node A comes back up DRBD reconnects and life moves on and everyone is happy.

Without fencing here is what happens:
VM 101 keeps running on NodeA replicating data via DRBD.
VM 101 is started on Node B
Now your VM 101 disk is fubar with corruption beyond belief because you have two VMs running at the same time writing to the same disk, ouch.
Instead of HA you now have a disaster on your hands, time to get the backups.

The last thing any of us will ever want to happen is to see the HA system mess up and actually CAUSE a problem rather than preventing it.
That is why fencing is required.
It is imperative that the node you "think" is dead, is actually dead.
Just because you can not ping it does not make it dead.
Just because it is not responding to the cluster does not mean it can not cause problems.

What if cman crashes on Node A?
Or you mess up the network config for the cluster?
Or some other unforeseen odd event happens?

The only 100% positive method to ensure that the VM is not still running, thus making it safe to start it elsewhere, is to fence the node that is no longer responding to the rest of the cluster.

This is very informative and has helped me understand fencing and why its needed

Thanks
 
This is very informative and has helped me understand fencing and why its needed

Thanks

Agree, many thanks, e100! :)

I only miss a thing; in the above setup

e100 said:
In our hypothetical setup Proxmox cluster communicates on eth0 and DRBD communicates on eth1.
Node A is running HA VM 101 and everything is running along fine until someone bumps a cable unplugging eth0 on Node A.
Proxmox cluster can not see eth0 so that node is assumed dead.
However it is not really dead and VM 101 is still running and still replicating data via DRBD on eth1.
With fencing here is what happens:
Node A is turned off(fenced)
...

well, if eth0 cable is unplugged, both nodes can't see the other one (eg in a two node cluster) ... so nodeA can also think nodeB is unreachable, or am i missing something? :)
why nodeA is switched off by nodeB, instead of the reverse? after all what i have is that the two cluster nodes are both up but disconnected, i really have two separated part of the cluster, how one is taking control over the other? i still miss this part... i've read many docs and pve wiki, but i still have no practical experience, and i'm a bit confused...

Thanks, Marco
 
Last edited:
why nodeA is switched off by nodeB, instead of the reverse? after all what i have is that the two cluster nodes are both up but disconnected, i really have two separated part of the cluster, how one is taking control over the other? i still miss this part.

Normally, the partition with quorum takes over controls. If you run a two-node cluster, each node tries to fence the other (the faster wins).
 
Normally, the partition with quorum takes over controls. If you run a two-node cluster, each node tries to fence the other (the faster wins).

perfect! Thanks, Dietmar, with three or more nodes, i've read there is a "vote" system, but on a (typical drbd) 2 node setup... i've never read a simple and clear statement like this!
 
I would recommend having at least three nodes so proper quorum is maintained if you really want HA.

I used a very old 1U server we had laying around as a third node.
It is so old it can not even run KVM since the CPU's lack the necessary instructions, but it runs Proxmox fine and acts as a cluster member.
It's only purpose is to be the third node so I have proper quorum for my two DRBD nodes.

Once we upgrade the other twelve 1.9 servers to 2.0 we will discard this temporary third node as it will no longer be needed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!