[SOLVED] HUGE Fencing problem with IPMI

@offerlam:

I have in a small business this scenery:
Only two PVE Nodes in HA for the VMs and without fence devices.
Always, i can do manually On-Line migration of the VMs without problems (always that the two nodes are working perfectly).

For get HA, i use only manual_fence, ie if a PVE Node show a erratic behavior, i will disconnect manually the power AC of this PVE Node, after, by CLI in the other PVE Node i will run the manual fence, then, the VMs of the old PVE Node will starts in this node that is alive

I have this scenery since many years ago in a production environment and don't have problems, but you can have the backup fence: the "ILO" and the "Manual" for get more security, and obviously the "ILO fence" must be your first option in both PVE Nodes.

Best regards
Cesar

Comment: This post has been re-edited

Ok so in other words what you do is if we say you have proxmox00 and proxmox01 in a cluster and the psu fails on proxmox01 you go to the cli of proxmox00 and do
Code:
fence_node proxmox01 -vv

and that would provoke all the VMs "locked" on proxmox01 to start on proxmox00?

That being said i would like a automated function of this if possible? and so i havne't gotten an answer yet to my question

im confused though... do i or do i NOT need a external fence device in order to have VM migrate or boot up on another node if another node lost power, mobo failure or other internal server failure that kills the server? Or can it be done with fencing on the servers themselves?

Thanks!

Casper
 
@offerlam:

I say that you must have 2 fence methods for all PVE Nodes in mode "Backup Fence":
The first fence method = "IPMI"
The second fence method = "fence_manual"

"Backup Fence" = If the first fence method don't work, the Cluster will apply the second method (or third..., etc.)
In this case with "fence_manual" that is the second method, require human intervention for function correctly, and is highly advisable previously in the Node with problems disconnect the power AC to be sure that there will not continue to operate, after you can run in some PVE Node that is alive with all security:
"/usr/sbin/fence_ack_manual <name-of-your-PVE-Node-that-is-Dead>"

Each fence method works in different manner and his configuration also is of different manner.

For understand how configure "only" the "fence_manual" please see this post:
http://forum.proxmox.com/threads/8623-Proxmox-2-node-cluster-fencing-device?p=48890#post48890
And see that command to run is: /usr/sbin/fence_ack_manual <name-of-your-PVE-Node-that-is-Dead>

But you must apply in all your PVE Nodes a configuration more complex (if you want more security, for example the IPMI chip on the Server is broken or the PVE Node was power off brutally), then, the first fence method (IPMI) will not work because the IPMI board will not respond correctly to your Cluster Software (so for your Cluster Software this method will be considered as null for not receive a answer positive of the board "IPMI"), for this reason i suggest 2 fence methods in mode "backup_fence", and the second method will be obviously the "fence_manual".

You must apply the two fence methods for caution if the first method don't work properly.

Best regards
Cesar
 
Last edited:
Hi all

Ok here is a little update cause we had a breakthrough..

By using fence_ilo3 instead og fence_ipmilan we can now fence servers that are running successfully..

BUT..

If I pull a power cord.. VMs are still not migrate or powered on on a different node. Now the way we understand it is that if this was a two node cluster we would need a external fence device for this to work.. but since this is a 3 node cluster this shouldn't be nessesary - PLEASE CONFIRM :)

I grapped the syslog from when we pulled the power cord using fence_ilo3 instead of fence_ipmilan

Code:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] New Configuration:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] #011r(0) ip(10.10.99.20) 
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] #011r(0) ip(10.10.99.22) 
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] Members Left:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] #011r(0) ip(10.10.99.21) 
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] Members Joined:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [QUORUM] Members[2]: 1 3
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] CLM CONFIGURATION CHANGE
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] New Configuration:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] #011r(0) ip(10.10.99.20) 
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] #011r(0) ip(10.10.99.22) 
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] Members Left:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CLM   ] Members Joined:
Jan  7 14:18:10 proxmox00 corosync[2603]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan  7 14:18:10 proxmox00 kernel: dlm: closing connection to node 2
Jan  7 14:18:10 proxmox00 rgmanager[2927]: State change: proxmox01 DOWN
Jan  7 14:18:10 proxmox00 corosync[2603]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.99.20) ; members(old:3 left:1)
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: members: 1/2477, 3/2461
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: starting data syncronisation
Jan  7 14:18:10 proxmox00 corosync[2603]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan  7 14:18:10 proxmox00 fenced[2664]: fencing node proxmox01
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: cpg_send_message retried 1 times
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: members: 1/2477, 3/2461
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: starting data syncronisation
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: received sync request (epoch 1/2477/00000002)
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: received sync request (epoch 1/2477/00000002)
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: received all states
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: leader is 1/2477
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: synced members: 1/2477, 3/2461
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: start sending inode updates
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: sent all (0) updates
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [status] notice: all data is up to date
Jan  7 14:18:10 proxmox00 pmxcfs[2477]: [dcdb] notice: received all states
Jan  7 14:18:11 proxmox00 pmxcfs[2477]: [status] notice: all data is up to date
Jan  7 14:18:11 proxmox00 pmxcfs[2477]: [status] notice: dfsm_deliver_queue: queue length 30
Jan  7 14:18:51 proxmox00 fenced[2664]: fence proxmox01 dev 0.0 agent fence_ilo3 result: error from agent
Jan  7 14:18:51 proxmox00 fenced[2664]: fence proxmox01 failed
Jan  7 14:18:54 proxmox00 fenced[2664]: fencing node proxmox01
Jan  7 14:19:34 proxmox00 fenced[2664]: fence proxmox01 dev 0.0 agent fence_ilo3 result: error from agent
Jan  7 14:19:34 proxmox00 fenced[2664]: fence proxmox01 failed
Jan  7 14:19:37 proxmox00 fenced[2664]: fencing node proxmox01
Jan  7 14:20:17 proxmox00 fenced[2664]: fence proxmox01 dev 0.0 agent fence_ilo3 result: error from agent
Jan  7 14:20:17 proxmox00 fenced[2664]: fence proxmox01 failed

it looks like a agent error. Rember that this works with fence_ilo3 so it should also work when we pull the power cord - PLEASE CONFIRM

We think we may have a systax error in our cluster.conf file

Code:
root@proxmox00:~# cat /etc/pve/cluster.conf<?xml version="1.0"?>
<cluster config_version="27" name="DingITCluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>
  <fencedevices>
    <fencedevice agent="fence_ilo3" ipaddr="10.10.99.30" login="admin" name="ipmi1" passwd="XXXXXX" method="cycle" />
    <fencedevice agent="fence_ilo3" ipaddr="10.10.99.31" login="admin" name="ipmi2" passwd="XXXXXX" method="cycle"/>
    <fencedevice agent="fence_ilo3" ipaddr="10.10.99.32" login="admin" name="ipmi3" passwd="XXXXXX" method="cycle"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox00" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi1" action="reboot"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox01" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi2" action="reboot"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox02" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi3" action="reboot"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="101"/>
    <pvevm autostart="1" vmid="102"/>
    <pvevm autostart="1" vmid="103"/>
    <pvevm autostart="1" vmid="104"/>
    <pvevm autostart="1" vmid="105"/>
    <pvevm autostart="1" vmid="107"/>
    <pvevm autostart="1" vmid="108"/>
    <pvevm autostart="1" vmid="109"/>
    <pvevm autostart="1" vmid="110"/>
    <pvevm autostart="1" vmid="111"/>
  </rm>
</cluster>
root@proxmox00:~#

Any suggestion will be apriciated!!

@Cesar and thheo

Before i take your solution into consideration i would like to know why I get fence agent failed when i pull the power plug but not when i do the fence_ilo3 command.. but thanks for the inputs.. if im f... with this agent failure i will try and implement

Thanks for all your help and inputs!

Casper
 
Last edited:
fence agent fails because it cannot connect to IPMI/iLO.. it expects a successful command to be performed on the ipmi device and since you pull the power plug from the node, that's not possible.
So the node is physically fenced, it will not overlap with cluster communication, network or storage, but rgmanager will not restart VMs on the other
nodes because it is considered unsafe. Fencing has to be formally successful so that it can move further. This is why I imagined a solution
where first you try ipmi/ilo fencing and if this one fails just execute /bin/true ( I am not sure if it expects a return code success or something else ).

LE: If you need something automatically and you cannot consider other situations where the setup fails and fencing ipmi couldn't work ( except this power off situation ) then you need to find this fake script that would just return success. Otherwise what Cesar mentioned seems to be the best option so that you can remain in full control and decide what's best in that situation ( but it takes time until you are able to do it )
 
Last edited:
Hi Thheo

SUPER explanation...and this is why a fencing capable PDU will do the trick because it can "cut the power" and force the fencing or would i NEVER be able to get this automated IE disregarding the /bin/true option which seems to be extremely discourage by proxmox staff?

Thanks

Casper
 
What is a bit weird to me is if your server has no power what are the chances that the PDU has power so you can control it? ( Except with some batteries to maintain the network connectivity and functionality ). Maybe just if you'll have a power-supply problem on the server.
I didn't say that you should use only fake fencing, I suggested you first use ipmi fencing and if it fails try a fake thing. But going further you can
improve this fake script by trying for example a ping on a VM ( or on something else that can be considered a good test case outside the problems of the power you mentioned ) and if you get a success then return fail and if not then return success so that fencing can be closed.
 
What is a bit weird to me is if your server has no power what are the chances that the PDU has power so you can control it? ( Except with some batteries to maintain the network connectivity and functionality ). Maybe just if you'll have a power-supply problem on the server.

the reason being this is located in a rented rack in a datacenter with the fancy stuff for power .. huge UPS and disel generators which kick in.. I don't think I at any point is gonna loose actual power.. but i may loose a PSU in the server and since its a one PSU server thats a big deal.. I think im gonna stick with the manual fencing thing for now :)'Thanks!Casper
 
Last edited:
OK...

I have now a working for plan for a PSU failure in my 3 node proxmox setup...

I went with Cesar plan to use fence_manual..

But just a FYI to you guys out there in proxmox 3 its not called fence_manual but its called fence_ack_manual and the syntax is fence_ack_manual node_name after pressing enter you need to accept by writing absolutly.

also you can't use fence_ack_manual on a working node.. it needs to be under some kind of condition like in my case we had to pull the power cable before we could get a successfull fence situation.

So if i went to node00 and wrote in the cli fence_ack_manual node01 pressed enter and wrote absolutely and than enter again not thing happend..

But if we pulled the power from node01 and then went to node00 and wrote the same thing again it worked.. so bear in mind you don't need to do anything to get fence_ack_manual working. No cluster.conf setting nothing.. it works out of the box if your node has a condition like powerfailure..

Thanks for all your help... this was great and we learned alot!

Casper
 
Last edited:
OK...

I have now a working for plan for a PSU failure in my 3 node proxmox setup...

I went with Cesar plan to use fence_manual..

But just a FYI to you guys out there in proxmox 3 its not called fence_manual but its called fence_ack_manual and the syntax is fence_ack_manual node_name after pressing enter you need to accept by writing absolutly.

also you can't use fence_ack_manual on a working node.. it needs to be under some kind of condition like in my case we had to pull the power cable before we could get a successfull fence situation.

So if i went to node00 and wrote in the cli fence_ack_manual node01 pressed enter and wrote absolutely and than enter again not thing happend..

But if we pulled the power from node01 and then went to node00 and wrote the same thing again it worked.. so bear in mind you don't need to do anything to get fence_ack_manual working. No cluster.conf setting nothing.. it works out of the box if your node has a condition like powerfailure..

Thanks for all your help... this was great and we learned alot!

Casper

But remember that the first choice must be the "IPMI" fence for get automatic fence, and if the first option fail (for example the "IPMI" board is decomposed), then, you second choice of fence must be the "fence_ack_manual.

The above will be the best configuration in your case for use it in production enviroments.

Best regards
Cesar
 
and its excatly how i plan it... in the future i will try and make something work with Centreon which is my web frontend for nagios to maybe automate this.. but for now this is how im doing it..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!