Auto migrate VM when a node has network fails or down

megap · Oct 7, 2014

Good morning to all.

I just configured a two node cluster with HA but I have a problem.

I have a VM (100) running in node 1 (gestion1), if I restart or shutdown manually node 1, this VM is migrated to node 2 withouth any problem, it works from node 2 to node 1, too.
VM is configured in a LVM data storage with drbd configured.

RGManager is running in two nodes.

The problem I have is:

If I have a VM running on node 1 (or node 2) and I quit the LAN cable or quit the power from the node, VM is not migrated to the other node from the cluster. The VM downs with the node.

My cluster.conf is:

Code:

<?xml version="1.0"?>
<cluster config_version="7" name="gestioncluster">
  <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_ilo" ipaddr="192.168.130.34" login="ADMIN" name="fenceA" passwd="ADMI$
    <fencedevice agent="fence_ilo" ipaddr="192.168.130.44" login="ADMIN" name="fenceB" passwd="ADMI$
  </fencedevices>
  <clusternodes>
    <clusternode name="gestion1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device action="reboot" name="fenceA"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gestion2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device action="reboot" name="fenceB"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100" recovery="relocate"/>
  </rm>
</cluster>

Proxmox version in two nodes:

Code:

pveversion -vproxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-5
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

I hope you can help me with this problem, if you need more info I can paste it.

megap · Oct 7, 2014

Hi again.

I disconnected node gestion1 from the network and I can see this error on node gestion2 in syslog tab:

Code:

[COLOR=#000000][FONT=tahoma]Oct  7 14:01:00 gestion2 fence_ilo: Parse error: Ignoring unknown option 'nodename=gestion1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct  7 14:01:00 gestion2 fence_ilo: The command was not found or was not executable: /usr/bin/gnutls-cli.
[/FONT][/COLOR]

Someone have any idea, please?

dietmar · Oct 7, 2014

megap said:
Someone have any idea, please?

Your fencing will never work (by design) if you cut network cable or switch power off.

Note: fence_ilo needs power and network

megap · Oct 7, 2014

dietmar said:
Your fencing will never work (by design) if you cut network cable or switch power off.

Note: fence_ilo needs power and network

Thanks for the reply dietmar.

And is anyway to do this? (Migrate a VM when cut network cable or switch power off).

Thanks again.

dietmar · Oct 7, 2014

megap said:
And is anyway to do this? (Migrate a VM when cut network cable or switch power off).

Yes, use a reasonable fence device (for example APC power fencing, ...)

fsoy · Dec 5, 2014

dietmar said:
Yes, use a reasonable fence device (for example APC power fencing, ...)

Hi,
sorry to dig this post, but I've this problem actually, and I'm not sure to understand your answer, Dietmar.
In my mind, the principle of HA is to propose an automatic failover solution if some machine on a cluster dies. So cutting network, IMHO, is the same as unplugging the power cable, which is the same of a crash of a machine : it is no more available on the network. For local network, and for fence, by the way. So : what is the difference between unplugging network cable and a machine crashing, if the VM don't automatically migrate ?

dietmar · Dec 5, 2014

fsoy said:
So : what is the difference between unplugging network cable and a machine crashing, if the VM don't automatically migrate ?

Your fence device on works when the network is available.

fsoy · Dec 5, 2014

dietmar said:
Your fence device on works when the network is available.

Sorry ? Thank you for responding quickly Dietmar but can you develop ? I don't understand what you mean by "your fence device on works" ?

dietmar · Dec 5, 2014

fsoy said:
Sorry ? Thank you for responding quickly Dietmar but can you develop ? I don't understand what you mean by "your fence device on works" ?

fence_ilo use IP protocol to fence the other node. If network is down, that will fail.

fsoy · Dec 5, 2014

Ah I've not indicate that it was fence_ipmi but it is the same thing. I really don't understand what fence does with the non-migration when a node crash. Sure, fence need network, its goal is to detect that a node is down through network, isn't it ? To try to reboot it, or participate to the decision that it is, effectively, definitively down. And permit to migrate VMs to a host alive, isn't the HA purpose ?? But when this node is down, are you saying that... fence prevent the migration ??? Or I haven't understand what "HA" means, or we are not talking about the same thing... I'm lost...

dietmar · Dec 6, 2014

fsoy said:
But when this node is down, are you saying that... fence prevent the migration ??? Or I haven't understand what "HA" means, or we are not talking about the same thing... I'm lost...

The cluster stack uses the fence device to detect if a node is really down. So if it cannot connect to the fence device, the cluster does not know that the node is really down. So it will make not decisions, and the VMs will not migrate.

fsoy · Dec 7, 2014

Well. It seems to me so incongruous that a system designed to take a decision about "High Availability" if a host is down, does nothing because... The host is down

But OK, I don't have probably the same understanding of HA, no matter, but now I try to imagine what can be do in this situation : a host dies, say the motherboad burns, or all power supplies burn, or so on : fence is out. The machine can not be repaired in a reasonable time, how can the VMs be migrated on another host, when we can't, even manually, migrate them by GUI or console ? Is there a way, a procedure somewhere ?

tom · Dec 13, 2014

fsoy said:
Well. It seems to me so incongruous that a system designed to take a decision about "High Availability" if a host is down, does nothing because... The host is down
But OK, I don't have probably the same understanding of HA, no matter, but now I try to imagine what can be do in this situation : a host dies, say the motherboad burns, or all power supplies burn, or so on : fence is out. The machine can not be repaired in a reasonable time, how can the VMs be migrated on another host, when we can't, even manually, migrate them by GUI or console ? Is there a way, a procedure somewhere ?

if a host died (say the motherboad burns, or all power supplies burn, or so on), the remaining nodes will fence this node to make sure that this one is really dead. this is essential, because otherwise there is a risk that VM or CT runs on two nodes in parallel and you will get corrupted data immediately.

But if you do not configure a independent fence device (like power fencing), the remaining nodes cannot fence the node in trouble, therefore they cannot start the VM or CT.

What you suggest - just starting without fencing - is impossible by design. hope its clearer now.

Search

Search

Auto migrate VM when a node has network fails or down

megap

New Member

megap

New Member

dietmar

Proxmox Staff Member

megap

New Member

dietmar

Proxmox Staff Member

fsoy

Member

dietmar

Proxmox Staff Member

fsoy

Member

dietmar

Proxmox Staff Member

fsoy

Member

dietmar

Proxmox Staff Member

fsoy

Member

tom

Proxmox Staff Member

We value your privacy