Problem with HA

nsscorp · Jan 10, 2013

We have installed a HA cluster with 2 HP servers using iLO interfaces for fencing. The system is full updated and it's running version 2.2-31.

From time to time the first node is loosing the communication with the second node (I guess so). After that the cluster appears as broken and we cannot even change7 the setting of the VMs (we have Device or resource busy message).

The prox1 logs:

rgmanager.log
Jan 09 16:41:34 rgmanager State change: prox02 DOWN

fenced.log
Jan 09 16:41:34 fenced fencing node prox02
Jan 09 16:41:36 fenced fence prox02 success

pvecm nodes
Node Sts Inc Joined Name
1 M 692 2012-12-18 12:15:57 prox01
2 X 716 prox02

The prox2 logs:

rgmanager.log
Jan 09 16:41:37 rgmanager #67: Shutting down uncleanly

fenced.log
Jan 09 16:41:37 fenced cluster is down, exiting

pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?

We also tried to restart cman to the second node and we have the following messages:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Starting GFS2 Control Daemon: gfs_controld.
Unfencing self... fence_node: cannot connect to cman
[FAILED]

and if we check the cman status it displays:

root@prox02:/var/run# /etc/init.d/cman status
Found stale pid file

Also if we check the fencing from prox01 it works always without a problem but from prox02:

fence_node: cannot connect to cman

If we reboot the two nodes the HA is working again without any problem until it broke again.

Is there any way to recover the cluster without rebooting the two nodes ?
How can we solve the cluster problem ?

tom · Jan 10, 2013

HA with two nodes is not really recommended and the basis for issues: see http://pve.proxmox.com/wiki/High_Availability_Cluster#System_requirements

post also your cluster.conf.

nsscorp · Jan 10, 2013

Hi Tom.

That's why we are using iLo for fencing. Here is my cluster.conf (the passwords are removed for security reasons).

<?xml version="1.0"?>
<cluster config_version="27" name="hfcloud">
<cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ilo" ipaddr="10.10.50.200" login="hpcluster" name="fenceNodeA" passwd="REMOVED"/>
<fencedevice agent="fence_ilo" ipaddr="10.10.50.201" login="hpcluster" name="fenceNodeB" passwd="REMOVED"/>
</fencedevices>
<clusternodes>
<clusternode name="prox01" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="status" name="fenceNodeA"/>
</method>
</fence>
</clusternode>
<clusternode name="prox02" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="status" name="fenceNodeB"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm/>
<totem window_size="100"/>

dietmar · Jan 10, 2013

nsscorp said:
From time to time the first node is loosing the communication with the second node (I guess so). After that the cluster appears as broken and we cannot even change7 the setting of the VMs (we have Device or resource busy message).

Node prox01 will fence prox02 if communication is broken (if fact both will fence each other). Are you sure fencing works? In your case, a reboot of prox02 should clear the issue?

nsscorp · Jan 10, 2013

I know that if we reboot the prox02 node the issue will cleared. Is there any way to prevent rebooting (the system is productive) ?

In this case fencing works.

Regards,

Antonis

dietmar · Jan 10, 2013

nsscorp said:
In this case fencing works.

How can fencing over IP work if communication is broken?

dietmar · Jan 10, 2013

nsscorp said:
I know that if we reboot the prox02 node the issue will cleared. Is there any way to prevent rebooting (the system is productive) ?

This is just my opinion, but what you do here is extremely dangerous. You run with two-node=1 and use unreliable fencing. I would never ever do that or suggest someone to use such setup.

nsscorp · Jan 10, 2013

There is no permanent disconnection but instantly.

dietmar · Jan 10, 2013

nsscorp said:
There is no permanent disconnection but instantly.

That is enough to get into an inconsistent state, because you effectively disable quorum protection with 'two-node=1'.

Search

Search

Problem with HA

nsscorp

New Member

tom

Proxmox Staff Member

nsscorp

New Member

dietmar

Proxmox Staff Member

nsscorp

New Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

nsscorp

New Member

dietmar

Proxmox Staff Member