HA errors and odd behaviour when adding servers to HA Cluster

webstaff · Apr 23, 2012

HI everyone first things first found PM 3 weeks ago and thumbs up

But I have some issues with the HA side of things.

Namely.

1. On physical machine restart I have to stop CMan -> CRON -> and restart PVECluster then start -> CRON -> CMan before RGmanager will start no a big issue but wondering why and also suspecting it might have something to do with issue 2.
Error when starting straight after restart if i dont do the above.

Starting Cluster Service Manager: [FAILED]
TASK ERROR: command '/etc/init.d/rgmanager start' failed: exit code 1

2. HA works fine RGManagers moves KVM's no issue but if i create a VM then add it to be HA managed it just does nothing gives 255, 254 and 250 errors on migration and HA migration and VM start. If i stop the host after it's been added to the HA it will not restarts. This leaves me two options, 1. take the server back out of the HA cluster.conf and machine starts without issue and can be migrated (offline and online) 2. I have to shut down the whole HA cluster all VMS nodes and then start them and go through the first item and stop start cycle, then everything works as expected e.g. machine is HA and start stops as expected.

error when trying to start newly added VM

Apr 23 17:48:39 VMS-BC-AM-003 pvedaemon[2462]: <root@pam> end task UPID:VMS-BC-AM-003:00003B95:0002CBD0:4F9587E6:hastart:106:root@pam: command 'clusvcadm -e pvevm:106 -m VMS-BC-AM-003' failed: exit code 254

Executing HA migrate for VM 106 to node VMS-BC-AM-001
Trying to migrate pvevm:106 to VMS-BC-AM-001...Temporary failure; try again
TASK ERROR: command 'clusvcadm -M pvevm:106 -m VMS-BC-AM-001' failed: exit code 250

2 x openfilers Cluster
iSCSI target single LUN -> LVM on top
3 x Servers as nodes (2 x G5 + 1 x Dell SC1435 all Fenced, see cluster.conf)
Latest PM updated yesterday

Network conf server 1 all are the same bar IP's
------------------------------------------------------------
# Network interface settings
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 10.0.0.30
netmask 255.255.255.0

auto eth0.8
iface eth0.8 inet static
address 10.0.2.31
netmask 255.255.255.0

auto eth0.10
iface eth0.10 inet manual

auto eth0.7
iface eth0.7 inet manual

auto eth1
iface eth1 inet static
address 10.0.0.31
netmask 255.255.255.0

auto eth1.8
iface eth1.8 inet static
address 10.0.2.32
netmask 255.255.255.0

auto eth1.10
iface eth1.10 inet manual

auto eth1.7
iface eth1.7 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.1.0.30
netmask 255.255.255.0
gateway 10.1.0.1
bridge_ports eth0.10
bridge_stp off
bridge_fd 0

auto vmbr1
iface vmbr1 inet static
address 10.1.0.31
netmask 255.255.255.0
bridge_ports eth1.10
bridge_stp off
bridge_fd 0

auto vmbr0.7
iface vmbr0.7 inet manual
bridge_ports eth0.7
bridge_stp off
bridge_fd 0

auto vmbr1.7
iface vmbr1.7 inet manual
bridge_ports eth1.7
bridge_stp off
bridge_fd 0
--------------------------------------------------
Cluster.conf

<?xml version="1.0"?>
<cluster config_version="48" name="bollicomp">
<logging debug="on"/>
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_ilo" ipaddr="10.0.2.30" login="root" name="node1" passwd="xxxx"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.0.2.40" lanplus="1" login="root" name="node2" passwd="xxxx" power_wait="5"/>
<fencedevice agent="fence_ilo" ipaddr="10.0.2.50" login="root" name="node3" passwd="xxxx"/>
</fencedevices>
<clusternodes>
<clusternode name="VMS-BC-AM-001" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="node1"/>
</method>
</fence>
</clusternode>
<clusternode name="VMS-BC-AM-002" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="node2"/>
</method>
</fence>
</clusternode>
<clusternode name="VMS-BC-AM-003" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="node3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<service autostart="1" exclusive="0" name="TestIP" recovery="relocate">
<ip address="10.1.0.60"/>
</service>
<pvevm autostart="1" vmid="102"/>
<pvevm autostart="1" vmid="105"/>
<pvevm autostart="1" vmid="103"/>
<pvevm autostart="1" vmid="104"/>
<pvevm autostart="1" vmid="106"/>
</rm>
</cluster>

-------------------------------------
Error is easily recreatable e.g. everytime I add a machine to HA.

Any ideas known bug maybe?

Regards

Dave Webster.

P.S. My linux skills are not perfect so might be simple lack of skills on my part so please bear this in mind

also any further info please let me know.

dietmar · Apr 24, 2012

You enabled fencing in /etc/default/redhat-cluster-pve?

see http://pve.proxmox.com/wiki/Fencing

webstaff · Apr 24, 2012

dietmar said:
You enabled fencing in /etc/default/redhat-cluster-pve?

see http://pve.proxmox.com/wiki/Fencing

Yes,

root@VMS-BC-AM-001:~# fence_tool ls
fence domain
member count 3
victim count 0
victim now 0
master nodeid 2
wait state none
members 1 2 3

Had to install ipmitools to test the Dell ipmi setup FYI.

Regards

Dave

dietmar · Apr 25, 2012

There should be a second log in the task list, including details why migrate fails?

webstaff · Apr 25, 2012

dietmar said:
There should be a second log in the task list, including details why migrate fails?

As per original post, migtration error.

Executing HA migrate for VM 106 to node VMS-BC-AM-002
Trying to migrate pvevm:106 to VMS-BC-AM-002...Temporary failure; try again
TASK ERROR: command 'clusvcadm -M pvevm:106 -m VMS-BC-AM-002' failed: exit code 250

starting error.

Executing HA start for VM 106
Member VMS-BC-AM-003 trying to enable pvevm:106...Aborted; service failed
TASK ERROR: command 'clusvcadm -e pvevm:106 -m VMS-BC-AM-003' failed: exit code 254

I'm presuming this is what you mean?

As per original post can move the VM if taken out of cluster.conf to another node and start without issue and migrate live without issue / error but then get the same error as above.

if I want to get it to work I can shutdown all the physical nodes and bring up the services one by one on each node and it will work, but cycling the single node of 003 doesn't do it so what is the system doing differently when i take the whole system offline campaired to a single node?

Regards

Dave

webstaff · Apr 25, 2012

Bump! page time out please log back in and lose you nice reply with all the extra logs and info on quorum and cluster log. 3 times now you'd think i'd learn lol

root@VMS-BC-AM-003:~# clustat
Cluster Status for bollicomp @ Wed Apr 25 12:58:58 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
VMS-BC-AM-001 1 Online, rgmanager
VMS-BC-AM-002 2 Online, rgmanager
VMS-BC-AM-003 3 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:102 VMS-BC-AM-002 started
pvevm:103 VMS-BC-AM-002 started
pvevm:104 VMS-BC-AM-001 started
pvevm:105 VMS-BC-AM-002 started
pvevm:106 (VMS-BC-AM-003) failed
service:TestIP VMS-BC-AM-001 started

root@VMS-BC-AM-003:~# cman_tool status
Version: 6.2.0
Config Version: 48
Cluster Name: bollicomp
Cluster Id: 52910
Cluster Member: Yes
Cluster Generation: 320
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: VMS-BC-AM-003
Node ID: 3
Multicast addresses: 239.192.206.125
Node addresses: 10.1.0.50

root@VMS-BC-AM-003:~# corosync-quorumtool -s
Version: 1.4.3
Nodes: 3
Ring ID: 320
Quorum type: quorum_cman
Quorate: Yes

root@VMS-BC-AM-003:~# corosync-quorumtool -l
Nodeid Name
1 10.1.0.30
2 10.1.0.40
3 VMS-BC-AM-003.bollicomp.com

Does that look right above should it not be DNS names for all the nodeid names? does it matter?

Dave

dietmar · Apr 25, 2012

Seems service 'pvevm:106' is in failed state:

> pvevm:106 (VMS-BC-AM-003) failed

You need to disable that service with 'clusvcadm' first.

from 'man rgmanager':

disable - stop the service and place into the disabled state. This is
the only permissible operation when a service is in the failed state.

So you need to exec:

# clusvcadm -d pvecm:106

After that you should be able to start again.

webstaff · Apr 25, 2012

dietmar said:
Seems service 'pvevm:106' is in failed state:

> pvevm:106 (VMS-BC-AM-003) failed

You need to disable that service with 'clusvcadm' first.

from 'man rgmanager':

So you need to exec:

# clusvcadm -d pvecm:106

After that you should be able to start again.

Had already tried this and again just now, task completes without error and then goes back to:

Executing HA start for VM 106
Member VMS-BC-AM-003 trying to enable pvevm:106...Aborted; service failed
TASK ERROR: command 'clusvcadm -e pvevm:106 -m VMS-BC-AM-003' failed: exit code 254

Regards

Dave

webstaff · Apr 25, 2012

webstaff said:
Had already tried this and again just now, task completes without error and then goes back to:

Executing HA start for VM 106
Member VMS-BC-AM-003 trying to enable pvevm:106...Aborted; service failed
TASK ERROR: command 'clusvcadm -e pvevm:106 -m VMS-BC-AM-003' failed: exit code 254

Regards

Dave

OK interesting, just edited /etc/vpe/cluster/cluster.conf to remove debug line as it was just filling my logs and not providing any info. after I saved it VM started???? no other changes to the system....

Thoughts?

Just going to see if i can recreate the issue with another VM.

Dave

webstaff · Apr 26, 2012

webstaff said:
OK interesting, just edited /etc/vpe/cluster/cluster.conf to remove debug line as it was just filling my logs and not providing any info. after I saved it VM started???? no other changes to the system....

Thoughts?

Just going to see if i can recreate the issue with another VM.

Dave

OK created a new windows 2k8 r2 machine and then moved it to HA while the machine was running activated the changes and it moved it from 001 too 003 node and started it as it always does.

I then went to migrate the host back to node 001 and got this error again.

Executing HA migrate for VM 105 to node VMS-BC-AM-001
Trying to migrate pvevm:105 to VMS-BC-AM-001...Target node dead / nonexistent
TASK ERROR: command 'clusvcadm -M pvevm:105 -m VMS-BC-AM-001' failed: exit code 244

I then checked the node 001 and noticed straight away that it's resource usage which should have been nothing as it has no VM's listed on it, was showing the exact usage of the machine that had just been moved by HA to node 003 so I now have 2 instances of the machine running one visable to PM and one thats not.
I can see both machines are live which is most worrying I'm just going to investigate some more now.

Can anyone suggest what I should be looking out for as I followed all the documentation to the latter so to have such a fundamental issue like this is rather worrying.

OK so this is just a little weird behaviour unlike the last 5 VM i've created this VM is sort of behaving as expected, it will migrate from node 003 where HA moved it, to node 002 which is the first time the systems let me do that without a lot of pissing around however it won't migrate to node 001 which i'm 100% is running a copy of the machine somewhere on it's self thats not shown by the webgui, I can't see this being behaviour PM wants in there system so before I reboot the box is there anything the PM team want me to do to see why this has occured?
or shall i just bash on with testing some more VM's?

Regards

Dave

dietmar · Apr 27, 2012

webstaff said:
or shall i just bash on with testing some more VM's?

It would be great if you can find a way to reproduce that bug.

Search

Search

HA errors and odd behaviour when adding servers to HA Cluster

webstaff

New Member

dietmar

Proxmox Staff Member

webstaff

New Member

dietmar

Proxmox Staff Member

webstaff

New Member

webstaff

New Member

dietmar

Proxmox Staff Member

webstaff

New Member

webstaff

New Member

webstaff

New Member

dietmar

Proxmox Staff Member

We value your privacy