[SOLVED] HUGE Fencing problem with IPMI

offerlam · Jan 5, 2014

thheo said:
1. Fencing means isolating a node that maybe is working on its own outside the cluster control. You want this to prevent a case that such a node is writing to the
shared storage and probably overlapping with a node that is entitled ( by the cluster ) to write in that area. So when a node is outside cluster control for
whatever reason you trigger a fencing scenario with the help of a device that can take actions like power cycle on the box and hopefully it comes back
online properly to the cluster system
2. The fact that VMs get moved (actually restarted ) to the other nodes on reboot it is called HA ( fencing is issued as a trigger because of your config and it's mandatory for the
reasons explained at 1. , but in your case it doesn't actually do anything ) and this is handled by rgmanager, because you have a <rm> section in your cluster.conf
which contains the proper settings for the VMs. You can check what's going on in /var/log/cluster/rgmanager.log
3. Of course you have to take care of the network connectivity issue you are facing before trying ipmitool to get fencing working:
Why do you bridge eth0 and eth1? Since you have STP off on the bridge, is it maybe a possibility that you are creating a loop with eth0, eth1 and an
external switch? Better explain to what eth0 is connected to, and to what eth1 is? And what exactly do you want to use them for? Also you mentioned
that you chose BMC to use all the NICs in a shared mode with the server so this one doesn't sound right with the fact that you are bridging all the interfaces, better
straighten your network setup first, it might just solve the reachability issue with the BMC.

I have two lan switch... eth0 is connected to sw1 and eth1 is connected too sw2 - STP is configured on both switches to kill loops.
The reason eth 2 and 3 is not connected is because i ran out of cables

but i will remove them on all servers to see if that helps.

The only way i can get IPMI/BMC up and running is by using shared NICS. I have a dedicated option but HP told me this was only to be used if i have a dedicated port for ILO and such which this server doesn't.

I don't think im creating loops as long as STP is set on both switches right?

my idea was to bundle all the 4 physical nics on each server with two nics connected to each switch.

I get your suspicion on looping and this being a network problem but i just don't see it since
1. we have had connection to the BMC controller before. The fact that we can't know came out of the blue
2. i DO have connectivity to the BMC controller by that i mean it IS responding to ping AND i can enter the web interface during bootup untill I get to where proxmox starts CMAN

offerlam · Jan 5, 2014

OH MY F... GOD!!!!!

Thheo you deserve a medal!!!

so i went and removed eth 2 and 3 from all the proxmox interfaces.. which did nothing after reboot.. so i removed eth1 so it was back to default eth1 and it WORKED! now i can ping both my bmc controller AND lan interface...

talking about not seeing the forrest for the trees.. or something

but i don't understand why it didn't work when i added the other interfaces?

EDIT:

And fencing is working

Code:

root@proxmox00:~# fence_node proxmox02 -vv
fence proxmox02 dev 0.0 agent fence_ipmilan result: success
agent args: nodename=proxmox02 agent=fence_ipmilan ipaddr=10.10.99.32 lanplus=1 auth=password login=admin passwd=XXXXXXX power_wait=5 method=cycle
fence proxmox02 success
root@proxmox00:~#

thheo · Jan 5, 2014

I suspect that the way of BMC port sharing on all interfaces works ( it has to do some internal bridging ) together with bridging interfaces via the host causes a loop. You could use a different vlan for the BMC
( i guess it's configurable ) and you can bring up an eth0.vlan with a different subnet just to have control for BMC. But still I would give up on the all-port sharing for BMC.. use only one and even with a vlan.
To me it doesn't make sense to simply bridge the host interfaces, but I don't know your full plan to have the full picture. Glad that you have this part sorted out.. you can move to the remaining problems

offerlam · Jan 5, 2014

About the port sharing.
According to HP support themselves dedicated is ONLY if I have a port for it in my server which i don't so i have to use nic sharing

I seem to remember to be able in bios to assign a vlan for BMC but i will need to check that when i get back to the datacenter.

So the way i have setup my network wasn't wrong i assume?

I haven't gotten to the excat networking setup yet. My plan was to just bondle all the interfaces over LACP and do some brutal testing of the redundancy by pulling power cords to different parts of the infrastructor and other stuff like lan cables...

I will at some point hopefully host client system om this proxmox solution. I would segment these clients into VLANS.. the way I was advised to do it was to bond all the interfaces and than make a individual bridge for each client with a vlan tag and use that interface for the clients VMs... now i don't remember if its the other way around bridge with multiple bonds or bonds with multiple bridges but that was my plan.

but knowing why i couldn't ping my bmc anymore is a HUGE step to get this up and running.

thheo · Jan 5, 2014

Maybe I misread this: "You also have to choose shared nics which means you can reach IPMI web interface on all nics in the server"
Did you choose one NIC or all the NICs for IPMI? I have no idea about iLo.. so maybe I am asking something stupid and choosing all NICs it's not possible, but I would recommend to stick with only
one shared and preferably with a dedicated VLAN. For my iDRAC I also use a shared NIC..
For interface bonding you should have a plan about what will you transfer over those and maybe separate things: like NFS where you need higher troughput
over dedicated links and normal traffic on others. It is "best practice" to separate these properly since you could have for example a DDoS targeted to
the VMs and it would cause congestion on the links between the host and the storage.
I would say it didn't make sense to have eth0 and eth1 bridged.. unless you make an ethernet ring and you connect these interfaces in 2 different ethernet switches
and all the elements inside this ring would run STP ( including the bridging instance ). You should consider a scenario where you have node and path redundancy:
that means that all the elements and the paths between them can fail and you still have connectivity.

offerlam · Jan 6, 2014

Hi Thheo,

in the BIOS for IPMI i can choose to either go with dedicated NIC or shared NIC.. its a simple menu dedicated or shared.. what HP told me was taht dedicated won't work unless I have a ILO port pressent which i doesn't so i can't use that.. I therefore have to use shared nics.. which means I can access my BMC interface from all the NICS in the server no matter how they are configured each networkwise..

My setup is as follows...

I have two WAN connection into my system.

I have two Cyberoam Firewalls in a HA configuration for fail over to each WAN and the LAN

I have two Network switches where port 21 22 23 and 24 are configured in a LACP active trunk - STP active

I was planing on have all 4 nics on my servers in one big LACP group with two nics connected to each switch. Based on what you say about NFS and i share you atitude i would proberly go two nics for VM lan connected with one in each switch and two other nics for the NFS network on the last two nics in the server..

this should mean that i could loose 1 wan connectivity and still be going.. loose one firewall and still be going .. loose one switch and still be going.. loose one node and still be going..

does this make sense?

Thanks

Casper

offerlam · Jan 6, 2014

Ok so today i went back to the datacenter..

all my nodes are now set to only use eth0 for network which is how i can maintain my BMC responding and working. the fence_node command works and all looks great so i figured i would try and pull the power cord to see what would happen....

NOTHING... the VMs stayed on the node and no migration was happening. I think i need the external fencing device mir was talking about...
I have looked into this one AP7920 from APC.. its a APC Switched Rack PDU 10A 1U 208/230V which i belive support fencing according to this site https://access.redhat.com/site/articles/28603#15.. am i right? would this be well for a external fencing device?

Also another EUREKA momen for me today.. by accident i expected the nic a little more closely and noticed that nic also had the mgmt text next to it. I'm thinking maybe HP was wrong and that if i set IPMI in the bios to dedicated NIC1 would become my dedicated IPMI/BMC NIC??? I tried it out both with DHCP and static IP set in bios but its not responding... any suggestions?

Thanks all..

Casper

offerlam · Jan 6, 2014

FYI:

after a long long talk with HP clearing up misunderstandings and so on it appears that if i choose dedicated in the bios for IPMI I need the mangemenet port expantion thing.. if i go with shared it will use nic1 and it would be shared with what ever nic ip config i would give it.. need to test this but it makes sense... one more problem is dead..

thheo · Jan 6, 2014

You could use an ethernet controlled PDU to achieve fencing, however I am thinking that the only reason for your IPMI not to work is power outage to that node, then
you could setup also a secondary fake fencing device that would return success no matter what so that rgmanager would restart the service on the other nodes.
Do you think you need to protect yourself from a situation where that node has no power? In such a case the node doesn't work so it won't affect the storage, but indeed if I remember
correctly if the fence command is not successful then the cluster will not release the lock for rgmanager.
Also if the node has no power what is the chance that the PDU has power so you can control it? Do you have dual-PS on each server?
Could you also check the logs in /var/log/rgmanager.log and the other logs to see what happens when you turn-off the power?
Also when you manually fence a node and gets brutally restarted, do the VMs start on the other nodes?

mir · Jan 6, 2014

From product sheet:
"Remote Individual Outlet Control: Remotely manage outlets so users can turn outlets off that are not in use (prevent overloads) or recycle power to locked-up equipment (minimize costly downtime and avoid travel time to equipment)."

So yes, this will do.

cesarpk · Jan 6, 2014

thheo said:
You could use an ethernet controlled PDU to achieve fencing, however I am thinking that the only reason for your IPMI not to work is power outage to that node, then
you could setup also a secondary fake fencing device that would return success no matter what so that rgmanager would restart the service on the other nodes.
Do you think you need to protect yourself from a situation where that node has no power? In such a case the node doesn't work so it won't affect the storage, but indeed if I remember
correctly if the fence command is not successful then the cluster will not release the lock for rgmanager.
Also if the node has no power what is the chance that the PDU has power so you can control it? Do you have dual-PS on each server?
Could you also check the logs in /var/log/rgmanager.log and the other logs to see what happens when you turn-off the power?
Also when you manually fence a node and gets brutally restarted, do the VMs start on the other nodes?

Right thheo, if you want avoid a single point of failure, will be necessary use two UPSs for two PDUs, and the cluster configuration file must have both PDUs configurated. Red Hat also have this situation documented, please see this link:
https://access.redhat.com/site/docu...ple_-_Fence_Devices/#Dual_Power_Configuration

Or this "Multiple Fence Methods per Node" (more easy for understand and apply to PVE):
https://access.redhat.com/site/docu..._Administration/s1-config-fencing-cli-CA.html

Or this: "Configuring a Backup Fence Device" (the more easy for understand and apply to PVE):
https://access.redhat.com/site/docu...nistration/s2-backup-fence-config-ccs-CA.html

Also you must think about of what happend if a Hardware component fail on the mainboard, as slots PCIe, the RAID Card, any board, etc, these problems aren't problems of power failure and in some cases may be necessary to make manual fence with human intervention, so i think that for best practices is always necessary to have manual_fence enabled as third option always that the two previous options must be the APC PDUs.

Best regard
Cesar

cesarpk · Jan 6, 2014

dietmar said:
If you use such device (UPS) you can also use it for fencing.

Hi Dietmar, a pleasure to greet you again.

Please let me to a question:

In this link i see that Red Hat don't tell us about of a UPS as Fence device:
https://access.redhat.com/site/articles/28603

Then, why do you say that a UPS may be used as fence device?
And please, if it can used as fence device, give me a link about of as configure it in the cluster configuration file.

Best regards
Cesar

mir · Jan 6, 2014

The wiki is your friend: http://pve.proxmox.com/wiki/Fencing#APC_Switch_Rack_PDU

cesarpk · Jan 7, 2014

mir said:
The wiki is your friend: http://pve.proxmox.com/wiki/Fencing#APC_Switch_Rack_PDU

Hi mir, as always a pleasure to greet !!!

I know about of the cluster configuration with APC Switch Rack PDU, i have tested. But i don't know if exist UPS as fence device, and if exist, if this/these is/are supported by the cluster communication of Red Hat. This is why i do my questions to Dietmar.

Best regards
Cesar

offerlam · Jan 7, 2014

First off.. thanks for the huge help so fare from all of you!

im confused though... do i or do i NOT need a external fence device in order to have VM migrate or boot up on another node if another node lost power, mobo failure or other internal server failure that kills the server? Or can it be done with fencing on the servers themselves?

I really would like a solution where i woulnd't have to invest in more gear like the pdu or ups .. or what ever

therefore what thheo suggest sounds interesting..

secondary fake fencing device

how would you achive this? and how solid would such a solution be?

thheo wrote:

Do you think you need to protect yourself from a situation where that node has no power? In such a case the node doesn't work so it won't affect the storage, but indeed if I remember
correctly if the fence command is not successful then the cluster will not release the lock for rgmanager.
Also if the node has no power what is the chance that the PDU has power so you can control it? Do you have dual-PS on each server?

That is my worry excatly.. my servers only have one psu! i got 3 so thats ok IF i can get them VMs to start on another node in case of a power failure.. my servers are in a hosted rack in a datacenter so i really don't think i will loose power at anypoint which could justify to buy a pdu.. I need a way to compensate for the fact that my servers only have one PSU...
I know that the affected server with the PSU failure won't affect storage but those VMs would be unavalible untill power is restored and i can't live with that.. i need them to start up on another node..

Right thheo, if you want avoid a single point of failure, will be necessary use two UPSs for two PDUs, and the cluster configuration file must have both PDUs configurated. Red Hat also have this situation documented, please see this link:
https://access.redhat.com/site/docum..._Configuration

Or this "Multiple Fence Methods per Node" (more easy for understand and apply to PVE):
https://access.redhat.com/site/docum...ng-cli-CA.html

Or this: "Configuring a Backup Fence Device" (the more easy for understand and apply to PVE):
https://access.redhat.com/site/docum...ig-ccs-CA.html

Also you must think about of what happend if a Hardware component fail on the mainboard, as slots PCIe, the RAID Card, any board, etc, these problems aren't problems of power failure and in some cases may be necessary to make manual fence with human intervention, so i think that for best practices is always necessary to have manual_fence enabled as third option always that the two previous options must be the APC PDUs.

Best regard
Cesar

Thanks for the input Cesar i will study those sites when i get the time

@Mir

Thanks now i have a backup plan incase i can't get this working without more gear!

Thanks for your inputs once more!

Casper

dietmar · Jan 7, 2014

cesarpk said:
In this link i see that Red Hat don't tell us about of a UPS as Fence device:
https://access.redhat.com/site/articles/28603

They call it PDU.

cesarpk · Jan 7, 2014

dietmar said:
They call it PDU.

Thanks Dietmar for your reply, but in this link i don't see nothing about of a autonomy (or duration) of a battery:
http://www.apc.com/resource/include/techspec_index.cfm?base_sku=ap7921&tab=features

Please show me with a Web link if I'm wrong.

Best regards
Cesar

cesarpk · Jan 7, 2014

@offerlam:

I have in a small business this scenery:
Only two PVE Nodes in HA for the VMs and without fence devices.
Always, i can do manually On-Line migration of the VMs without problems (always that the two nodes are working perfectly).

For get HA, i use only manual_fence, ie if a PVE Node show a erratic behavior, i will disconnect manually the power AC of this PVE Node, after, by CLI in the other PVE Node i will run the manual fence, then, the VMs of the old PVE Node will starts in this node that is alive

I have this scenery since many years ago in a production environment and don't have problems, but you can have the backup fence: the "ILO" and the "Manual" for get more security, and obviously the "ILO fence" must be your first option in both PVE Nodes.

Best regards
Cesar

Comment: This post has been re-edited

dietmar · Jan 7, 2014

cesarpk said:
Thanks Dietmar for your reply, but in this link i don't see nothing about of a autonomy (or duration) of a battery:
http://www.apc.com/resource/include/techspec_index.cfm?base_sku=ap7921&tab=features

Please show me with a Web link if I'm wrong.

The APC 'UPS Network Management Card" is listed as AP9606. But AFAIC that produced is dicontinued, an now called AP9630 ...

cesarpk · Jan 7, 2014

dietmar said:
The APC 'UPS Network Management Card" is listed as AP9606. But AFAIC that produced is dicontinued, an now called AP9630 ...

Many thanks Dietmar, i will study this option

. When?, I do not know.

Best regards
Cesar

[SOLVED] HUGE Fencing problem with IPMI

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

We value your privacy