Proxmox 3.4 - 2 Node Cluster: rgmanager doesn't automatically start the vm(s) after node failure

Alfredo Salas

Member
Oct 31, 2016
2
0
6
49
Hi,9

I have a very simple 2 Node cluster using DRBD as Network RAID1 under Proxmox 3.4.

Currently I have 5 VM running in that simple cluster.

Everything is fine, I can manually migrate the VMs from srv3ve to srv4ve and viceversa without issue.

The issue is, if the VMs are running in node srv4ve and manually cause a malfunction on that machine (electric blackout), the VMs are automatically migrate to the remained node srv3ve without issue in less than 1 minute, which is very good!.

In /var/log/messages in srv3ve I can see all the activity working as expected. (attached the log file called var-log-messages-srv3)
and in /var/log/cluster/rgmanager.log: (attached the log file called var-log-cluster-rgmanager-srv3)

If the VMs are running on srv3ve and the same failure is induced, the VMs remains attached/anchored to srv3ve and rgmanager doesn't start them on srv4ve as the previous situation.

The /var/log/cluster/rgmanager.log reports the srv3ve FAILURE but it doesn't do anything, just it freezed and no more logs are reported.

/var/log/cluster/rgmanager.log in srv4ve after srv3ve failure:

+++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:47 rgmanager State change: srv3ve DOWN

No more logs .....

++++++++++++++++++++++++++++++++++++++++++++

/var/log/messages in srv4ve after srv3ve failure (attached the log file called var-log-cluster-rgmanager-srv4):

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:37 srv4ve kernel: tg3 0000:0c:00.1: eth1: Link is down


Oct 31 16:44:47 srv4ve corosync[3137]: [CLM ] Members Left:
Oct 31 16:44:47 srv4ve corosync[3137]: [CLM ] Members Joined:
Oct 31 16:44:47 srv4ve corosync[3137]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 31 16:44:47 srv4ve rgmanager[3517]: State change: srv3ve DOWN
Oct 31 16:44:47 srv4ve corosync[3137]: [CPG ] chosen downlist: sender r(0) ip(130.107.1.229) ; members(old:2 left:1)
Oct 31 16:44:47 srv4ve corosync[3137]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 31 16:44:47 srv4ve fenced[3221]: fencing node srv3ve
Oct 31 16:44:48 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:44:55 srv4ve kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 31 16:44:55 srv4ve kernel: block drbd0: new current UUID AB381FADC047C11D:CE62294FE6D56FC3:DD66792D83296C6F:DD65792D83296C6F

Oct 31 16:45:34 srv4ve fenced[3221]: fencing node srv3ve
Oct 31 16:45:34 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:45:57 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:46:20 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve
Oct 31 16:46:44 srv4ve fence_ilo4: Parse error: Ignoring unknown option 'nodename=srv3ve

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



The funny thing is that when the srv3ve becomes functional again (can take long time), just then VMs are migrated to srv4ve node automatically by the rgmanager.

/var/log/cluster/rgmanager.log after srv3ve comes back (in this example after 6 minutes)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Oct 31 16:44:47 rgmanager State change: srv3ve DOWN
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:202
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:303
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:203
Oct 31 16:50:13 rgmanager Starting stopped service pvevm:205
Oct 31 16:50:14 rgmanager Starting stopped service pvevm:301
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 202 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 303 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 203 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 205 to local node
Oct 31 16:50:14 rgmanager [pvevm] Move config for VM 301 to local node
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:15 rgmanager Service pvevm:202 started
Oct 31 16:50:16 rgmanager Service pvevm:303 started
Oct 31 16:50:16 rgmanager Service pvevm:203 started
Oct 31 16:50:16 rgmanager State change: srv3ve UP
Oct 31 16:50:16 rgmanager Service pvevm:205 started
Oct 31 16:50:17 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:18 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:19 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:20 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:21 rgmanager [pvevm] Task still active, waiting
Oct 31 16:50:21 rgmanager Service pvevm:301 started

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

/var/log/messages after srv3ve comes back (6 minutes blackout) ((attached the log file called var-log-messages-srv4)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Oct 31 16:47:29 srv4ve kernel: tg3 0000:0c:00.1: eth1: Link is up at 1000 Mbps, full duplex
Oct 31 16:49:22 srv4ve kernel: block drbd0: Handshake successful: Agreed network protocol version 96
O
Oct 31 16:49:59 srv4ve corosync[3137]: [CPG ] chosen downlist: sender r(0) ip(130.107.1.228) ; members(old:1 left:0)
Oct 31 16:49:59 srv4ve corosync[3137]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:202
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:303
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:203
Oct 31 16:50:13 srv4ve rgmanager[3517]: Starting stopped service pvevm:205
Oct 31 16:50:14 srv4ve rgmanager[3517]: Starting stopped service pvevm:301
Oct 31 16:50:14 srv4ve rgmanager[9228]: [pvevm] Move config for VM 202 to local node
Oct 31 16:50:14 srv4ve task UPID:srv4ve:00002421:0004A859:5817BC96:qmstart:202:root@pam:: start VM

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

My cluster.conf

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

<?xml version="1.0"?>
<cluster config_version="22" name="U-CLUSTER1">
<cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_n ode="1"/>
<clusternodes>
<clusternode name="srv3ve" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fenceNodeA"/>
</method>
</fence>
</clusternode>
<clusternode name="srv4ve" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fenceNodeB"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ilo4" cipher="3" ipaddr="139.107.1.226" lanplus="1 " login="root" name="fenceNodeA" passwd="master123" power_wait="5"/>
<fencedevice agent="fence_ilo4" cipher="3" ipaddr="139.107.1.227" lanplus="1 " login="root" name="fenceNodeB" passwd="master123" power_wait="5"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="cluster1" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="srv3ve" priority="1"/>
<failoverdomainnode name="srv4ve" priority="1"/>
</failoverdomain>
</failoverdomains>
<pvevm domain="cluster1" autostart="1" vmid="202" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="303" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="203" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="205" recovery="relocate"/>
<pvevm domain="cluster1" autostart="1" vmid="301" recovery="relocate"/>
</rm>
</cluster>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

My clustat output:

++++++++++++++++++++++++++++++++++++++++++++++++++
# clustat
Cluster Status for U-CLUSTER1 @ Mon Oct 31 16:36:17 2016
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
srv3ve 1 Online, Local, rgmanager
srv4ve 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:202 srv3ve started
pvevm:203 srv3ve started
pvevm:205 srv3ve started
pvevm:301 srv3ve started
pvevm:303 srv3ve started

+++++++++++++++++++++++++++++++++++++++++++++++++++

My pvecm status output:

+++++++++++++++++++++++++++++++++++++++++++++++

# pvecm status
Version: 6.2.0
Config Version: 22
Cluster Name: U-CLUSTER1
Cluster Id: 2009
Cluster Member: Yes
Cluster Generation: 272
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0 177
Node name: srv3ve
Node ID: 1
Multicast addresses: 239.192.7.224
Node addresses: 139.107.1.228
+++++++++++++++++++++++++++++++++++++++++++++

Can anybody help me with this strange behaivor please?

Thanks in advanced

Alfredo
 

Attachments

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!