HA does not work as expected... is it feature complete?

kankamuso · Jan 17, 2012

Dear proxmox developers,

We have correctly configured HA following your demo video (thnaks a lot !). It does migrate an HA managed machine as soon as we stop the RGManager service. So far so good. Nevertheless, if we try the "classical" approach for testing HA it does nothing:

- Pull power cord.
- Seconds later, the host appears as power off.
- Two-three minutes later, the HA-enabled machine appears off-line (black little screen).
- After a while... nothing happens.

Does this mean...

- We did something wrong? Is there any timer we need to configure?
- Are we expected to script something in our systems (if so what?)?
- Is Proxmox HA feature completed?. (i.e. failover domain is not accessible under GUI but... is it necessary for HA to work as I expect it?).

Thanks in advance.

Jose.

dietmar · Jan 17, 2012

Most likely you run a cluster with 2 nodes?

Please can you post the cluster config? (/etc/pve/cluster.conf)

kankamuso · Jan 17, 2012

dietmar said:
Most likely you run a cluster with 2 nodes?

Please can you post the cluster config? (/etc/pve/cluster.conf)

Yes we do.

Here you are the conf:

<?xml version="1.0"?>
<cluster config_version="38" name="murgiHA">
<cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ilo_mp" ipaddr="192.168.0.102" login="userilo" name="nodeaiLO" passwd="XXXXXX"/>
<fencedevice agent="fence_ilo_mp" ipaddr="192.168.0.103" login="userilo" name="nodebiLO" passwd="XXXXXX"/>
</fencedevices>
<clusternodes>
<clusternode name="nodea" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="nodeaiLO"/>
</method>
</fence>
</clusternode>
<clusternode name="nodeb" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="nodebiLO"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="104"/>
<pvevm autostart="1" vmid="101"/>
</rm>
</cluster>

tom · Jan 17, 2012

kankamuso said:
Yes we do.

...

HA needs 3 nodes, see http://pve.proxmox.com/wiki/High_Availability_Cluster

If you do not have 3 nodes, you need to work with a quorum disk, not tested and supported yet but I assume it should work well.

kankamuso · Jan 17, 2012

Dear Tom,

As Dietmar stated on the last post from this thread:

http://forum.proxmox.com/threads/7786-Cluster-and-quorum?highlight=qdisk

it is possible to give two votes to one of the nodes. Does this mean no qdisk should be needed then?. We have enough redundancy on our interconnection network to prevent connection problems.

Also, this thread on the mailing list suggest it is possible (see second post on thread):

http://pve.proxmox.com/pipermail/pve-user/2012-January/003067.html

If qdisk is a must... we don't have a shared storage other than a drbd partition on both nodes. Is this schema suitable to create a qdisk partition (drbd in active-active mode)?.

Regards,

Jose.

dietmar · Jan 18, 2012

Your config should work (qdisk is not strictly needed). Maybe there was a problem with fencing? Any hints in the logs?

kankamuso · Jan 18, 2012

dietmar said:
Your config should work (qdisk is not strictly needed). Maybe there was a problem with fencing? Any hints in the logs?

Thanks, those are good news. Here is the log from the "alive" machine:

Jan 18 08:32:33 almerimar corosync[1974]: [QUORUM] Members[1]: 2
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] CLM CONFIGURATION CHANGE
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] New Configuration:
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] Members Left:
Jan 18 08:32:33 almerimar corosync[1974]: [CLM ] Members Joined:
Jan 18 08:32:33 almerimar corosync[1974]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 18 08:32:33 almerimar rgmanager[2507]: State change: elejido DOWN
Jan 18 08:32:33 almerimar kernel: dlm: closing connection to node 1
Jan 18 08:32:33 almerimar corosync[1974]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:2 left:1)
Jan 18 08:32:33 almerimar pmxcfs[1700]: [dcdb] members: 2/1700
Jan 18 08:32:33 almerimar pmxcfs[1700]: [status] members: 2/1700
Jan 18 08:32:33 almerimar corosync[1974]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 18 08:32:33 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:38 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:38 almerimar fenced[2314]: fence elejido failed
Jan 18 08:32:41 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:46 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:46 almerimar fenced[2314]: fence elejido failed
Jan 18 08:32:49 almerimar fenced[2314]: fencing node elejido
Jan 18 08:32:55 almerimar fenced[2314]: fence elejido dev 0.0 agent fence_ilo_mp result: error from agent
Jan 18 08:32:55 almerimar fenced[2314]: fence elejido failed

It does seem to be failing on fencing indeed.... (the machine is powered off, but fencing should work, right?). Any hints on how to go deeper to elucidate the problem???..

Thanks a lot for your valuable help !

kankamuso · Jan 18, 2012

But wait a moment... If I pull the cable (or power goes offline), the fencing is going to fail always... iLo has no power to answer....

kankamuso · Jan 18, 2012

And... what about multicast?. Is it needed or not in a two-node environment?

tom · Jan 18, 2012

corosyn is always active and needed part as its used for pmxcfs. so yes, in any Proxmox VE 2.x cluster, HA or not, multicast is needed.

kankamuso · Jan 18, 2012

tom said:
corosyn is always active and needed part as its used for pmxcfs. so yes, in any Proxmox VE 2.x cluster, HA or not, multicast is needed.

But corosync seems to be working. In fact, if I stop RGManager on one server, machines are migrated as expected to the other machine. In any other case they won't be migrated. I desperately need to get HA working :-(.... We would be delighted to help writing the Wiki regarding this if we get it working... is this possible?.

Thanks in advance.

dietmar · Jan 18, 2012

kankamuso said:
But wait a moment... If I pull the cable (or power goes offline), the fencing is going to fail always... iLo has no power to answer....

That explain why it does not work - fencing need to be successful - else HA can't start the resource on another node.

kankamuso · Jan 18, 2012

dietmar said:
That explain why it does not work - fencing need to be successful - else HA can't start the resource on another node.

Does that mean that HA is not going to work under proxmox in the event of a power outage?. It has also failed when disconnecting the ethernet interface for data (the fencing one is connected). It has been detected the other node is offline, but no fencing appears on the log then.

kankamuso · Jan 18, 2012

It worked !!. We had a problem with the fencing manager. We were using fence_ilo_mp instead of fence_ilo and we did not have an "action" label on the config.log.

Hope this helps !!

udo · Jan 18, 2012

kankamuso said:
...
If qdisk is a must... we don't have a shared storage other than a drbd partition on both nodes. Is this schema suitable to create a qdisk partition (drbd in active-active mode)?.
...

Hi Jose,
I assume that using an drbd-disk as quorum-disk is an bad idea. Because if the link between the nodes fail, both nodes see the own quorum-disk and are the running part of the cluster - which results in a splittet cluster (and with fencing device kill the other node).

Udo

bread-baker · Jan 18, 2012

udo said:
Hi Jose,
I assume that using an drbd-disk as quorum-disk is an bad idea. Because if the link between the nodes fail, both nodes see the own quorum-disk and are the running part of the cluster - which results in a splittet cluster (and with fencing device kill the other node).

Udo

Udo - AFAIK drbd would be more reliable then qdisk as there is not a single point of failure. ( I have not used qdisk , but read docs at redhat ).
drbd has for us been very reliable , although it had for me a steep learning curve. However the drbd docs have evolved and are much easier for me to understand.

udo · Jan 18, 2012

bread-baker said:
Udo - AFAIK drbd would be more reliable then qdisk as there is not a single point of failure. ( I have not used qdisk , but read docs at redhat ).
drbd has for us been very reliable , although it had for me a steep learning curve. However the drbd docs have evolved and are much easier for me to understand.

Hi,
of course is DRBD an good solution, but as qdisk only if:
a) the nodes with the DRBD-disks are not clustermember (like an NAS which provide the cluster also with storage)
b) both clusternodes looking on the same qdisk-node.

In other case you don't need an qdisk because again this should the qdisk prevent: you can easily split the cluster (the nodes don't see each other) and both nodes are running with an valid quorum because both see one part of the qdisk!

I think if you use an qdisk, you must take the right connection for the disk. AFAIK it's not usefull to connect the cluternodes via ethernet and use an FC qdisk - if the ethernet connection fail, the FC-connection can still working...

Udo

nciriaco · May 3, 2012

I´ve installed Proxmox 2.0 into 2 separate servers joint by a crossover ethernet cable and it worked properly.
However, when installing Proxmox 2.0 in HP Blade BL460 I´ve experimented the "off-line" issue as mentioned by kankamuso on the first post.

I´ve ran some tests about multicast and it´s working fine.

Any ideas?

Regards,

Juan Ignacio

kankamuso said:
Dear proxmox developers,

We have correctly configured HA following your demo video (thnaks a lot !). It does migrate an HA managed machine as soon as we stop the RGManager service. So far so good. Nevertheless, if we try the "classical" approach for testing HA it does nothing:

- Pull power cord.
- Seconds later, the host appears as power off.
- Two-three minutes later, the HA-enabled machine appears off-line (black little screen).
- After a while... nothing happens.

Does this mean...

- We did something wrong? Is there any timer we need to configure?
- Are we expected to script something in our systems (if so what?)?
- Is Proxmox HA feature completed?. (i.e. failover domain is not accessible under GUI but... is it necessary for HA to work as I expect it?).

Thanks in advance.

Jose.

nciriaco · May 3, 2012

Also tried with 2.1 and still getting second node offline

nciriaco said:
I´ve installed Proxmox 2.0 into 2 separate servers joint by a crossover ethernet cable and it worked properly.
However, when installing Proxmox 2.0 in HP Blade BL460 I´ve experimented the "off-line" issue as mentioned by kankamuso on the first post.

I´ve ran some tests about multicast and it´s working fine.

Any ideas?

Regards,

Juan Ignacio

HA does not work as expected... is it feature complete?

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Distinguished Member

Member

Distinguished Member

nciriaco

Guest

nciriaco

Guest