Auto migrate VM when a node has network fails or down

megap

New Member
Oct 1, 2014
20
0
1
Good morning to all.

I just configured a two node cluster with HA but I have a problem.

I have a VM (100) running in node 1 (gestion1), if I restart or shutdown manually node 1, this VM is migrated to node 2 withouth any problem, it works from node 2 to node 1, too.
VM is configured in a LVM data storage with drbd configured.

RGManager is running in two nodes.

The problem I have is:

If I have a VM running on node 1 (or node 2) and I quit the LAN cable or quit the power from the node, VM is not migrated to the other node from the cluster. The VM downs with the node.

My cluster.conf is:

Code:
<?xml version="1.0"?>
<cluster config_version="7" name="gestioncluster">
  <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_ilo" ipaddr="192.168.130.34" login="ADMIN" name="fenceA" passwd="ADMI$
    <fencedevice agent="fence_ilo" ipaddr="192.168.130.44" login="ADMIN" name="fenceB" passwd="ADMI$
  </fencedevices>
  <clusternodes>
    <clusternode name="gestion1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device action="reboot" name="fenceA"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gestion2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device action="reboot" name="fenceB"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100" recovery="relocate"/>
  </rm>
</cluster>

Proxmox version in two nodes:

Code:
pveversion -vproxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-5
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

I hope you can help me with this problem, if you need more info I can paste it.
 
Hi again.

I disconnected node gestion1 from the network and I can see this error on node gestion2 in syslog tab:

Code:
[COLOR=#000000][FONT=tahoma]Oct  7 14:01:00 gestion2 fence_ilo: Parse error: Ignoring unknown option 'nodename=gestion1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct  7 14:01:00 gestion2 fence_ilo: The command was not found or was not executable: /usr/bin/gnutls-cli.
[/FONT][/COLOR]

Someone have any idea, please? :confused:
 
Your fencing will never work (by design) if you cut network cable or switch power off.

Note: fence_ilo needs power and network

Thanks for the reply dietmar.

And is anyway to do this? (Migrate a VM when cut network cable or switch power off).

Thanks again.
 
Yes, use a reasonable fence device (for example APC power fencing, ...)

Hi,
sorry to dig this post, but I've this problem actually, and I'm not sure to understand your answer, Dietmar.
In my mind, the principle of HA is to propose an automatic failover solution if some machine on a cluster dies. So cutting network, IMHO, is the same as unplugging the power cable, which is the same of a crash of a machine : it is no more available on the network. For local network, and for fence, by the way. So : what is the difference between unplugging network cable and a machine crashing, if the VM don't automatically migrate ?
 
Your fence device on works when the network is available.
Sorry ? Thank you for responding quickly Dietmar but can you develop ? I don't understand what you mean by "your fence device on works" ?
 
Sorry ? Thank you for responding quickly Dietmar but can you develop ? I don't understand what you mean by "your fence device on works" ?

fence_ilo use IP protocol to fence the other node. If network is down, that will fail.
 
Ah I've not indicate that it was fence_ipmi but it is the same thing. I really don't understand what fence does with the non-migration when a node crash. Sure, fence need network, its goal is to detect that a node is down through network, isn't it ? To try to reboot it, or participate to the decision that it is, effectively, definitively down. And permit to migrate VMs to a host alive, isn't the HA purpose ?? But when this node is down, are you saying that... fence prevent the migration ??? Or I haven't understand what "HA" means, or we are not talking about the same thing... I'm lost...
 
But when this node is down, are you saying that... fence prevent the migration ??? Or I haven't understand what "HA" means, or we are not talking about the same thing... I'm lost...

The cluster stack uses the fence device to detect if a node is really down. So if it cannot connect to the fence device, the cluster does not know that the node is really down. So it will make not decisions, and the VMs will not migrate.
 
Well. It seems to me so incongruous that a system designed to take a decision about "High Availability" if a host is down, does nothing because... The host is down o_O
But OK, I don't have probably the same understanding of HA, no matter, but now I try to imagine what can be do in this situation : a host dies, say the motherboad burns, or all power supplies burn, or so on : fence is out. The machine can not be repaired in a reasonable time, how can the VMs be migrated on another host, when we can't, even manually, migrate them by GUI or console ? Is there a way, a procedure somewhere ?
 
Well. It seems to me so incongruous that a system designed to take a decision about "High Availability" if a host is down, does nothing because... The host is down o_O
But OK, I don't have probably the same understanding of HA, no matter, but now I try to imagine what can be do in this situation : a host dies, say the motherboad burns, or all power supplies burn, or so on : fence is out. The machine can not be repaired in a reasonable time, how can the VMs be migrated on another host, when we can't, even manually, migrate them by GUI or console ? Is there a way, a procedure somewhere ?

if a host died (say the motherboad burns, or all power supplies burn, or so on), the remaining nodes will fence this node to make sure that this one is really dead. this is essential, because otherwise there is a risk that VM or CT runs on two nodes in parallel and you will get corrupted data immediately.

But if you do not configure a independent fence device (like power fencing), the remaining nodes cannot fence the node in trouble, therefore they cannot start the VM or CT.

What you suggest - just starting without fencing - is impossible by design. hope its clearer now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!