HA does not work as expected... is it feature complete?

kankamuso

Active Member
Oct 19, 2011
76
0
26
Dear proxmox developers,

We have correctly configured HA following your demo video (thnaks a lot !). It does migrate an HA managed machine as soon as we stop the RGManager service. So far so good. Nevertheless, if we try the "classical" approach for testing HA it does nothing:

- Pull power cord.
- Seconds later, the host appears as power off.
- Two-three minutes later, the HA-enabled machine appears off-line (black little screen).
- After a while... nothing happens.

Does this mean...

- We did something wrong? Is there any timer we need to configure?
- Are we expected to script something in our systems (if so what?)?
- Is Proxmox HA feature completed?. (i.e. failover domain is not accessible under GUI but... is it necessary for HA to work as I expect it?).

Thanks in advance.

Jose.
 
Last edited:
Most likely you run a cluster with 2 nodes?

Please can you post the cluster config? (/etc/pve/cluster.conf)
 
Most likely you run a cluster with 2 nodes?

Please can you post the cluster config? (/etc/pve/cluster.conf)

Yes we do.

Here you are the conf:

<?xml version="1.0"?>
<cluster config_version="38" name="murgiHA">
<cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ilo_mp" ipaddr="192.168.0.102" login="userilo" name="nodeaiLO" passwd="XXXXXX"/>
<fencedevice agent="fence_ilo_mp" ipaddr="192.168.0.103" login="userilo" name="nodebiLO" passwd="XXXXXX"/>
</fencedevices>
<clusternodes>
<clusternode name="nodea" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="nodeaiLO"/>
</method>
</fence>
</clusternode>
<clusternode name="nodeb" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="nodebiLO"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="104"/>
<pvevm autostart="1" vmid="101"/>
</rm>
</cluster>
 
Dear Tom,

As Dietmar stated on the last post from this thread:

http://forum.proxmox.com/threads/7786-Cluster-and-quorum?highlight=qdisk

it is possible to give two votes to one of the nodes. Does this mean no qdisk should be needed then?. We have enough redundancy on our interconnection network to prevent connection problems.

Also, this thread on the mailing list suggest it is possible (see second post on thread):

http://pve.proxmox.com/pipermail/pve-user/2012-January/003067.html

If qdisk is a must... we don't have a shared storage other than a drbd partition on both nodes. Is this schema suitable to create a qdisk partition (drbd in active-active mode)?.

Regards,

Jose.
 
Last edited:
Your config should work (qdisk is not strictly needed). Maybe there was a problem with fencing? Any hints in the logs?
 
Your config should work (qdisk is not strictly needed). Maybe there was a problem with fencing? Any hints in the logs?

Thanks, those are good news. Here is the log from the "alive" machine:

Jan 18 08:32:33 almerimar corosync[1974]: [QUORUM] Members[1]: 2
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] CLM CONFIGURATION CHANGE
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] New Configuration:
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] Members Left:
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] Members Joined:
Jan 18 08:32:33 almerimar corosync[1974]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 18 08:32:33 almerimar rgmanager[2507]: State change: elejido DOWN
Jan 18 08:32:33 almerimar kernel: dlm: closing connection to node 1
Jan 18 08:32:33 almerimar corosync[1974]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:2 left:1)
Jan 18 08:32:33 almerimar pmxcfs[1700]: [dcdb] members: 2/1700
Jan 18 08:32:33 almerimar pmxcfs[1700]: [status] members: 2/1700
Jan 18 08:32:33 almerimar corosync[1974]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 18 08:32:33 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:38 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:38 almerimar fenced[2314]: fence elejido failed
Jan 18 08:32:41 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:46 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:46 almerimar fenced[2314]: fence elejido failed
Jan 18 08:32:49 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:55 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:55 almerimar fenced[2314]: fence elejido failed

It does seem to be failing on fencing indeed.... (the machine is powered off, but fencing should work, right?). Any hints on how to go deeper to elucidate the problem???..

Thanks a lot for your valuable help !
 
But wait a moment... If I pull the cable (or power goes offline), the fencing is going to fail always... iLo has no power to answer....
 
corosyn is always active and needed part as its used for pmxcfs. so yes, in any Proxmox VE 2.x cluster, HA or not, multicast is needed.
 
corosyn is always active and needed part as its used for pmxcfs. so yes, in any Proxmox VE 2.x cluster, HA or not, multicast is needed.

But corosync seems to be working. In fact, if I stop RGManager on one server, machines are migrated as expected to the other machine. In any other case they won't be migrated. I desperately need to get HA working :-(.... We would be delighted to help writing the Wiki regarding this if we get it working... is this possible?.

Thanks in advance.
 
But wait a moment... If I pull the cable (or power goes offline), the fencing is going to fail always... iLo has no power to answer....

That explain why it does not work - fencing need to be successful - else HA can't start the resource on another node.
 
Last edited by a moderator:
That explain why it does not work - fencing need to be successful - else HA can't start the resource on another node.

Does that mean that HA is not going to work under proxmox in the event of a power outage?. It has also failed when disconnecting the ethernet interface for data (the fencing one is connected). It has been detected the other node is offline, but no fencing appears on the log then.
 
It worked !!. We had a problem with the fencing manager. We were using fence_ilo_mp instead of fence_ilo and we did not have an "action" label on the config.log.

Hope this helps !!
 
...
If qdisk is a must... we don't have a shared storage other than a drbd partition on both nodes. Is this schema suitable to create a qdisk partition (drbd in active-active mode)?.
...
Hi Jose,
I assume that using an drbd-disk as quorum-disk is an bad idea. Because if the link between the nodes fail, both nodes see the own quorum-disk and are the running part of the cluster - which results in a splittet cluster (and with fencing device kill the other node).

Udo
 
Hi Jose,
I assume that using an drbd-disk as quorum-disk is an bad idea. Because if the link between the nodes fail, both nodes see the own quorum-disk and are the running part of the cluster - which results in a splittet cluster (and with fencing device kill the other node).

Udo


Udo - AFAIK drbd would be more reliable then qdisk as there is not a single point of failure. ( I have not used qdisk , but read docs at redhat ).
drbd has for us been very reliable , although it had for me a steep learning curve. However the drbd docs have evolved and are much easier for me to understand.
 
Udo - AFAIK drbd would be more reliable then qdisk as there is not a single point of failure. ( I have not used qdisk , but read docs at redhat ).
drbd has for us been very reliable , although it had for me a steep learning curve. However the drbd docs have evolved and are much easier for me to understand.
Hi,
of course is DRBD an good solution, but as qdisk only if:
a) the nodes with the DRBD-disks are not clustermember (like an NAS which provide the cluster also with storage)
b) both clusternodes looking on the same qdisk-node.

In other case you don't need an qdisk because again this should the qdisk prevent: you can easily split the cluster (the nodes don't see each other) and both nodes are running with an valid quorum because both see one part of the qdisk!

I think if you use an qdisk, you must take the right connection for the disk. AFAIK it's not usefull to connect the cluternodes via ethernet and use an FC qdisk - if the ethernet connection fail, the FC-connection can still working...

Udo
 
I´ve installed Proxmox 2.0 into 2 separate servers joint by a crossover ethernet cable and it worked properly.
However, when installing Proxmox 2.0 in HP Blade BL460 I´ve experimented the "off-line" issue as mentioned by kankamuso on the first post.

I´ve ran some tests about multicast and it´s working fine.

Any ideas?

Regards,

Juan Ignacio

Dear proxmox developers,

We have correctly configured HA following your demo video (thnaks a lot !). It does migrate an HA managed machine as soon as we stop the RGManager service. So far so good. Nevertheless, if we try the "classical" approach for testing HA it does nothing:

- Pull power cord.
- Seconds later, the host appears as power off.
- Two-three minutes later, the HA-enabled machine appears off-line (black little screen).
- After a while... nothing happens.

Does this mean...

- We did something wrong? Is there any timer we need to configure?
- Are we expected to script something in our systems (if so what?)?
- Is Proxmox HA feature completed?. (i.e. failover domain is not accessible under GUI but... is it necessary for HA to work as I expect it?).

Thanks in advance.

Jose.
 
Also tried with 2.1 and still getting second node offline :(


I´ve installed Proxmox 2.0 into 2 separate servers joint by a crossover ethernet cable and it worked properly.
However, when installing Proxmox 2.0 in HP Blade BL460 I´ve experimented the "off-line" issue as mentioned by kankamuso on the first post.

I´ve ran some tests about multicast and it´s working fine.

Any ideas?

Regards,

Juan Ignacio
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!