Proxmox 3.4 - 2 Node Cluster: rgmanager doesn't automatically start the vm(s) after node failure

Alfredo Salas · Oct 31, 2016

Hi,9

I have a very simple 2 Node cluster using DRBD as Network RAID1 under Proxmox 3.4.

Currently I have 5 VM running in that simple cluster.

Everything is fine, I can manually migrate the VMs from srv3ve to srv4ve and viceversa without issue.

The issue is, if the VMs are running in node srv4ve and manually cause a malfunction on that machine (electric blackout), the VMs are automatically migrate to the remained node srv3ve without issue in less than 1 minute, which is very good!.

In /var/log/messages in srv3ve I can see all the activity working as expected. (attached the log file called var-log-messages-srv3)
and in /var/log/cluster/rgmanager.log: (attached the log file called var-log-cluster-rgmanager-srv3)

If the VMs are running on srv3ve and the same failure is induced, the VMs remains attached/anchored to srv3ve and rgmanager doesn't start them on srv4ve as the previous situation.

The /var/log/cluster/rgmanager.log reports the srv3ve FAILURE but it doesn't do anything, just it freezed and no more logs are reported.

/var/log/cluster/rgmanager.log in srv4ve after srv3ve failure:

+++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:47 rgmanager State change: srv3ve DOWN

No more logs .....

++++++++++++++++++++++++++++++++++++++++++++

/var/log/messages in srv4ve after srv3ve failure (attached the log file called var-log-cluster-rgmanager-srv4):

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:37 srv4ve kernel: tg3 0000:0c:00.1: eth1: Link is down

Oct 31 16:44:47 srv4ve corosync[3137]: [CLM ] Members Left:
Oct 31 16:44:47 srv4ve corosync[3137]: [CLM ] Members Joined:
Oct 31 16:44:47 srv4ve corosync[3137]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 31 16:44:47 srv4ve rgmanager[3517]: State change: srv3ve DOWN
Oct 31 16:44:47 srv4ve corosync[3137]: [CPG ] chosen downlist: sender r(0) ip(130.107.1.229) ; members(old:2 left:1)
Oct 31 16:44:47 srv4ve corosync[3137]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 31 16:44:47 srv4ve fenced[3221]: fencing node srv3ve
Oct 31 16:44:48 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:44:55 srv4ve kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 31 16:44:55 srv4ve kernel: block drbd0: new current UUID AB381FADC047C11D:CE62294FE6D56FC3

D66792D83296C6F

D65792D83296C6F

Oct 31 16:45:34 srv4ve fenced[3221]: fencing node srv3ve
Oct 31 16:45:34 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:45:57 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:46:20 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:46:44 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The funny thing is that when the srv3ve becomes functional again (can take long time), just then VMs are migrated to srv4ve node automatically by the rgmanager.

/var/log/cluster/rgmanager.log after srv3ve comes back (in this example after 6 minutes)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:47 rgmanager State change: srv3ve DOWN
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:202
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:303
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:203
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:205
Oct 31 16:50:14 rgmanager Starting stopped service pvevm:301
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 202 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 303 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 203 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 205 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 301 to local node
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager Service pvevm:202 started
Oct 31 16:50:16 rgmanager Service pvevm:303 started
Oct 31 16:50:16 rgmanager Service pvevm:203 started
Oct 31 16:50:16 rgmanager State change: srv3ve UP
Oct 31 16:50:16 rgmanager Service pvevm:205 started
Oct 31 16:50:17 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:18 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:19 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:20 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:21 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:21 rgmanager Service pvevm:301 started

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

/var/log/messages after srv3ve comes back (6 minutes blackout) ((attached the log file called var-log-messages-srv4)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Oct 31 16:47:29 srv4ve kernel: tg3 0000:0c:00.1: eth1: Link is up at 1000 Mbps, full duplex
Oct 31 16:49:22 srv4ve kernel: block drbd0: Handshake successful: Agreed network protocol version 96
O
Oct 31 16:49:59 srv4ve corosync[3137]: [CPG ] chosen downlist: sender r(0) ip(130.107.1.228) ; members(old:1 left:0)
Oct 31 16:49:59 srv4ve corosync[3137]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:202
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:303
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:203
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:205
Oct 31 16:50:14 srv4ve rgmanager[3517]: Starting stopped service pvevm:301
Oct 31 16:50:14 srv4ve rgmanager[9228]: [pvevm] Move config for VM 202 to local node
Oct 31 16:50:14 srv4ve task UPID:srv4ve:00002421:0004A859:5817BC96:qmstart:202:root@pam:: start VM

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

My cluster.conf

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

<?xml version="1.0"?>
<cluster config_version="22" name="U-CLUSTER1">
<cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_n ode="1"/>
<clusternodes>
<clusternode name="srv3ve" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fenceNodeA"/>
</method>
</fence>
</clusternode>
<clusternode name="srv4ve" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fenceNodeB"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ilo4" cipher="3" ipaddr="139.107.1.226" lanplus="1 " login="root" name="fenceNodeA" passwd="master123" power_wait="5"/>
<fencedevice agent="fence_ilo4" cipher="3" ipaddr="139.107.1.227" lanplus="1 " login="root" name="fenceNodeB" passwd="master123" power_wait="5"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="cluster1" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="srv3ve" priority="1"/>
<failoverdomainnode name="srv4ve" priority="1"/>
</failoverdomain>
</failoverdomains>
<pvevm domain="cluster1" autostart="1" vmid="202" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="303" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="203" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="205" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="301" recovery="relocate"/>
</rm>
</cluster>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

My clustat output:

++++++++++++++++++++++++++++++++++++++++++++++++++
# clustat
Cluster Status for U-CLUSTER1 @ Mon Oct 31 16:36:17 2016
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
srv3ve 1 Online, Local, rgmanager
srv4ve 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:202 srv3ve started
pvevm:203 srv3ve started
pvevm:205 srv3ve started
pvevm:301 srv3ve started
pvevm:303 srv3ve started

+++++++++++++++++++++++++++++++++++++++++++++++++++

My pvecm status output:

+++++++++++++++++++++++++++++++++++++++++++++++

# pvecm status
Version: 6.2.0
Config Version: 22
Cluster Name: U-CLUSTER1
Cluster Id: 2009
Cluster Member: Yes
Cluster Generation: 272
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0 177
Node name: srv3ve
Node ID: 1
Multicast addresses: 239.192.7.224
Node addresses: 139.107.1.228
+++++++++++++++++++++++++++++++++++++++++++++

Can anybody help me with this strange behaivor please?

Thanks in advanced

Alfredo

Alfredo Salas · Nov 7, 2016

Any response

Search

Search

Proxmox 3.4 - 2 Node Cluster: rgmanager doesn't automatically start the vm(s) after node failure

Alfredo Salas

Member

Attachments

Alfredo Salas

Member

We value your privacy