Recovery after manual fencing

Whatever · Dec 17, 2012

Hello, guys

We are using 3-nodes cluster with recent Proxmox VE (2.2): Node01, Node02 and Node-test for quorum and fence manual (for time-been)
Last night something strange happen and cluster said: "Node01 is down" (in fact node was alive and fully operational

)
We did fence_ack_manual Node01 (on Node02) in order to use HA, all the VM with HA enabled successfully moved to Node02
But when we restarted Node01 both Node01 and Node02 started rebooting continuously (one by one)

Did we miss something? Should we provide some more commands to cluster saying that Node01 is back? Should we do cluster/fence domain leave for the node before it starts again?

Could you please, provide the accurate guide what one should do in case when a node is failed and recovered afterwards? (in terms of cluster/fencing domain management)

Thanks in advance!

hotwired007 · Dec 17, 2012

soudns liek you have an issue with your fencing config - whats your cluster.conf? do you ahve any fencing configured at all?

dietmar · Dec 17, 2012

Whatever said:
Last night something strange happen and cluster said: "Node01 is down"

You need to find out why such 'strange' things happens - there is always a reason for such things.

Whatever · Dec 18, 2012

hotwired007 said:
soudns liek you have an issue with your fencing config - whats your cluster.conf? do you ahve any fencing configured at all?

Here is my cluster config file:

<?xml version="1.0"?>
<cluster config_version="35" name="zim-cluster">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_manual" name="human"/>
</fencedevices>
<clusternodes>
<clusternode name="pve-node01" nodeid="1" votes="1">
<fence>
<method name="single">
<device name="human" nodename="pve-node01"/>
</method>
</fence>
</clusternode>
<clusternode name="pve-node02" nodeid="2" votes="1">
<fence>
<method name="single">
<device name="human" nodename="pve-node02"/>
</method>
</fence>
</clusternode>
<clusternode name="pve-test" nodeid="3" votes="1">
<fence>
<method name="single">
<device name="human" nodename="pve-test"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm/>
</cluster>

well, here is the plan how it should be by design (please, correct me if I'm wrong)

1. Cluster: Node1, Node2, Node-X (X - just for quorum) + datastore (NFS) + manual (human) fencing
2. Node1 hosts VM1, Node2 hosts VM2 (VM1, VM2 with HA enabled)
3. Node1 fails (whatever reason)
4. The one logins to Node2 and type: fence_ack_manual Node1 (should it be any live nodes in cluster or explicitly fence domain master node?)
5. VM1 moves to Node2 (or Node-X) depending on the rules from failover domain (if exists)
6. The one recovers Node1 and starts it up
7. Node1 connects to cluster and automatically enters back to the cluster and DOES NOT try to start VM1

is the #7 correct and no special commands/moves required to bring Node1 back to cluster?

dietmar · Dec 18, 2012

Whatever said:
4. The one logins to Node2 and type: fence_ack_manual Node1 (should it be any live nodes in cluster or explicitly fence domain master node?)

You need to run that on the fence domain master node.

dietmar · Dec 18, 2012

Whatever said:
4. The one logins to Node2 and type: fence_ack_manual Node1 (should it be any live nodes in cluster or explicitly fence domain master node?)

And first, you need to make sure the node is offline (power off the node before doing fence_ack_manual)

Whatever · Dec 18, 2012

dietmar said:
You need to run that on the fence domain master node.

Ok, let's suppose:
1) Failed Node was the Fence domain master. Will the master be changed by cluster automatically?
2) The one did fence_ack_manual NOT on the fence domain master node. What the one should do afterwards? (nobody is perfect and everybody could make a mistake - so we need to know how to act in the worst case).

Whatever · Dec 18, 2012

dietmar said:
And first, you need to make sure the node is offline (power off the node before doing fence_ack_manual)

Yeap, I'm very clear on the fence part (how important it is and what could happen if fence is not done properly)

I'm mainly pointing on the recovery steps (when failed node becomes alive and we want/need to bring it back to cluster operation mode)

Thanks in advance)

Whatever · Dec 18, 2012

in my cluster i've got:

root@pve-test:~# fence_tool -n ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 3
wait state none
members 2 3
all nodes
nodeid 1 member 0 victim 0 last fence master 3 how override
nodeid 2 member 1 victim 0 last fence master 0 how none
nodeid 3 member 1 victim 0 last fence master 0 how none

dietmar · Dec 18, 2012

Whatever said:
Ok, let's suppose:
1) Failed Node was the Fence domain master. Will the master be changed by cluster automatically?

sure.

Whatever said:
2) The one did fence_ack_manual NOT on the fence domain master node. What the one should do afterwards? (nobody is perfect and everybody could make a mistake - so we need to know how to act in the worst case).

Simply stop/kill fence_ack_manual command

But seriously, you should use a real fence device instead.

Whatever · Dec 18, 2012

dietmar said:
sure.
Simply stop/kill fence_ack_manual command

is it possible even when the command returns: "Done" ???? Is this a special process that send something to cluster net? (what's the name?)

vcp_ai · Dec 22, 2013

Hi I'm experiencing similar problem as described here, so I've reused this thread.

I'm testing CLUSTER using proxmox 'self-virtualization' as suggested by some user on this forum (thanks to him, can not find the thread where I read it, sorry).
I've created 3 virtual machines, each of them running PVE (version 3.1.24)
I've installed and tested cluster with usual problems, all of them solved.
I could migrate a running machine form one node to other again without problems.

So I thought why not go a step ahead and try to test HA ?

I used 'fence_manual' pseudo-agent as described i this thread.

Code:

<?xml version="1.0"?>
<cluster config_version="28" name="cluster16x">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_manual" name="humano"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="vmprox160" nodeid="1" votes="1">
      <fence>
        <method name="single">
          <device name="humano" nodename="vmprox160"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vmprox161" nodeid="2" votes="1">
      <fence>
        <method name="single">
          <device name="humano" nodename="vmprox161"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vmprox162" nodeid="3" votes="1">
      <fence>
        <method name="single">
          <device name="humano" nodename="vmprox162"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="3333"/>
  </rm>
</cluster>

Status of working cluster:

Code:

root@vmprox160:~# clustat
Cluster Status for cluster16x @ Sun Dec 22 12:18:32 2013
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 vmprox160                                   1 Online, Local, rgmanager
 vmprox161                                   2 Online, rgmanager
 vmprox162                                   3 Online, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 pvevm:3333                     vmprox160                      started
root@vmprox160:~#

After having it working, I performed this test (have one machine -3333- declared as HA)
1.- STOP the virtual machine that was running proxmox server 1 (vmprox160), so the server and VM 3333 dissapeared from the cluster
2.- send a fence_ack_manual vmprox160 to signal the cluster that vmprox160 has been manually retired from the cluster and thus telling other servers STOP trying to fence vmprox160 (as seen on man fence_ack_manual )
3.- Then VM 3333 automatically restarted on other node , as expected...

Problem is that when I power on (start) again vmprox160, I am not able to get it again into the fence domain.

It appears correctly to the the cluster, It can be used from the web GUI, and have its vote in the cluster

Code:

root@vmprox160:~# pvecm s
Version: 6.2.0
Config Version: 28
Cluster Name: cluster16x
Cluster Id: 41800
Cluster Member: Yes
Cluster Generation: 268
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: vmprox160
Node ID: 1
Multicast addresses: 239.192.163.235
Node addresses: 192.168.200.160
root@vmprox160:~#

But not to the fence service

Code:

root@[B]vmprox160[/B]:~# fence_tool -n ls
root@vmprox160:~#            <<<<<<<<<<<<  NO ANSWER  <<<<<<<<<<<

root@[B]vmprox162[/B]:~# fence_tool -n ls
fence domain
member count  2
victim count  0
victim now    0
master nodeid 2
wait state    none
members       2 3
all nodes
nodeid 1 member 0 victim 0 last fence master 2 how override    <<<<<<<<<<<<<<  ?????
nodeid 2 member 1 victim 0 last fence master 0 how none
nodeid 3 member 1 victim 0 last fence master 0 how none

root@vmprox162:~#

rgmanager service appears as stopped.
Starting it from the web or via /etc/init.d/rgmanager start give OK status , but inmediately appears again as stopped.

Code:

root@vmprox160:~# /etc/init.d/rgmanager start
Starting Cluster Service Manager: [  OK  ]
root@vmprox160:~# ps -ef | grep rgm
root        3699    2623  0 11:49 pts/0    00:00:00 grep rgm
root@vmprox160:~#

Other tests performed:

Code:

root@vmprox160:~# fence_tool join
fence_tool: fenced not running, no lockfile
root@vmprox160:~#

Resumen:
I can not get the machine again into the fence domain
May be has set the somethig in the cluster to not allow vmprox160 to get again into the cluster?
What is the meaning of override status at the end on fence_tool ls for this machine ?

Basically this seems the same question asked in post #1 of this thread

Did we miss something? Should we provide some more commands to cluster saying that Node01 is back? Should we do cluster/fence domain leave for the node before it starts again?

Could you please, provide the accurate guide what one should do in case when a node is failed and recovered afterwards? (in terms of cluster/fencing domain management)

Any hints ?

Regards

vcp_ai · Dec 26, 2013

Could solve by :

Code:

/etc/init.d/cman restart
sleep 5
/etc/init.d/pve-cluster restart
sleep 5
/etc/init.d/rgmanager restart

May be it is because I am using a STOP to the virtual machine that is runnig PVE, to simulate a problem with a server ?
Regards

Search

Search

Recovery after manual fencing

Whatever

Renowned Member

hotwired007

Member

dietmar

Proxmox Staff Member

Whatever

Renowned Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

Whatever

Renowned Member

Whatever

Renowned Member

Whatever

Renowned Member

dietmar

Proxmox Staff Member

Whatever

Renowned Member

vcp_ai

Renowned Member

vcp_ai

Renowned Member