Hi I'm experiencing similar problem as described here, so I've reused this thread.
I'm testing CLUSTER using proxmox 'self-virtualization' as suggested by some user on this forum (thanks to him, can not find the thread where I read it, sorry).
I've created 3 virtual machines, each of them running PVE (version 3.1.24)
I've installed and tested cluster with usual problems, all of them solved.
I could migrate a running machine form one node to other again without problems.
So I thought why not go a step ahead and try to test HA ?
I used 'fence_manual' pseudo-agent as described i this thread.
Code:
<?xml version="1.0"?>
<cluster config_version="28" name="cluster16x">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_manual" name="humano"/>
</fencedevices>
<clusternodes>
<clusternode name="vmprox160" nodeid="1" votes="1">
<fence>
<method name="single">
<device name="humano" nodename="vmprox160"/>
</method>
</fence>
</clusternode>
<clusternode name="vmprox161" nodeid="2" votes="1">
<fence>
<method name="single">
<device name="humano" nodename="vmprox161"/>
</method>
</fence>
</clusternode>
<clusternode name="vmprox162" nodeid="3" votes="1">
<fence>
<method name="single">
<device name="humano" nodename="vmprox162"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="3333"/>
</rm>
</cluster>
Status of working cluster:
Code:
root@vmprox160:~# clustat
Cluster Status for cluster16x @ Sun Dec 22 12:18:32 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
vmprox160 1 Online, Local, rgmanager
vmprox161 2 Online, rgmanager
vmprox162 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:3333 vmprox160 started
root@vmprox160:~#
After having it working, I performed this test (have one machine -
3333- declared as HA)
1.- STOP the virtual machine that was running proxmox server 1 (vmprox160), so the server and VM
3333 dissapeared from the cluster
2.- send a
fence_ack_manual vmprox160 to signal the cluster that vmprox160 has been manually retired from the cluster and thus telling other servers STOP trying to fence vmprox160 (as seen on
man fence_ack_manual )
3.- Then VM
3333 automatically restarted on other node , as expected...
Problem is that when I power on (start) again vmprox160,
I am not able to get it again into the fence domain.
It appears correctly to the the cluster, It can be used from the web GUI, and have its vote in the cluster
Code:
root@vmprox160:~# pvecm s
Version: 6.2.0
Config Version: 28
Cluster Name: cluster16x
Cluster Id: 41800
Cluster Member: Yes
Cluster Generation: 268
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: vmprox160
Node ID: 1
Multicast addresses: 239.192.163.235
Node addresses: 192.168.200.160
root@vmprox160:~#
But not to the fence service
Code:
root@[B]vmprox160[/B]:~# fence_tool -n ls
root@vmprox160:~# <<<<<<<<<<<< NO ANSWER <<<<<<<<<<<
root@[B]vmprox162[/B]:~# fence_tool -n ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 2
wait state none
members 2 3
all nodes
nodeid 1 member 0 victim 0 last fence master 2 how override <<<<<<<<<<<<<< ?????
nodeid 2 member 1 victim 0 last fence master 0 how none
nodeid 3 member 1 victim 0 last fence master 0 how none
root@vmprox162:~#
rgmanager service appears as stopped.
Starting it from the web or via /etc/init.d/rgmanager start give OK status , but inmediately appears again as stopped.
Code:
root@vmprox160:~# /etc/init.d/rgmanager start
Starting Cluster Service Manager: [ OK ]
root@vmprox160:~# ps -ef | grep rgm
root 3699 2623 0 11:49 pts/0 00:00:00 grep rgm
root@vmprox160:~#
Other tests performed:
Code:
root@vmprox160:~# fence_tool join
fence_tool: fenced not running, no lockfile
root@vmprox160:~#
Resumen:
I can not get the machine again into the fence domain
May be has set the somethig in the cluster to not allow vmprox160 to get again into the cluster?
What is the meaning of override status at the end on fence_tool ls for this machine ?
Basically this seems the same question asked in post #1 of this thread
Did we miss something? Should we provide some more commands to cluster saying that Node01 is back? Should we do cluster/fence domain leave for the node before it starts again?
Could you please, provide the accurate guide what one should do in case when a node is failed and recovered afterwards? (in terms of cluster/fencing domain management)
Any hints ?
Regards