HA migration on node failure restarts VMs

thheo · Dec 22, 2013

Hello,

I am trying to make a setup of Two-Node HA ( https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster ).
I have 2 identical machines (Dell R720 with idrac7) and I have setup a PVE Cluster with these 2 and a quorum disk via an iscsi target from a third-machine.
Although everything seems to work fine: I can do live migration with no packet loss, if I manually fence a node
or crash it on purpose the VM running on the "broken" node is getting moved to the operational one but I get this in the logs:

Code:

Dec 22 02:01:11 rgmanager State change: proxmox2 DOWN
Dec 22 02:01:34 rgmanager Marking service:gfs2-2 as stopped: Restricted domain unavailable
Dec 22 02:01:34 rgmanager Starting stopped service pvevm:101
Dec 22 02:01:34 rgmanager [pvevm] VM 100 is running
Dec 22 02:01:35 rgmanager [pvevm] Move config for VM 101 to local node
Dec 22 02:01:36 rgmanager Service pvevm:101 started
==
Dec 22 02:01:11 fenced fencing node proxmox2
Dec 22 02:01:33 fenced fence proxmox2 success
==

I have 2 VMs (100,101) both are CentOS 6 (actually 101 is a clone of 100 ) with which I ran these tests.
The setup consists in a drbd session between the 2 nodes on top of which I run gfs2 ( no lvm involved ). I had a hard time
mounting this resource at startup and my cluster.conf looks like this now:

Code:

<?xml version="1.0"?>
<cluster config_version="39" name="Cluster">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>
  <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10" votes="1"/>
  <totem token="1000"/>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="192.168.162.90" login="fence" name="proxmox1-drac" passwd="123456" secure="1"/>
    <fencedevice agent="fence_ipmilan" ipaddr="192.168.162.91" login="fence" name="proxmox2-drac" passwd="123456" secure="1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="proxmox1-drac"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="proxmox2-drac"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
   <failoverdomains>
        <failoverdomain name="node1" nofailback="0" ordered="0" restricted="1">
            <failoverdomainnode name="proxmox1"/>
        </failoverdomain>
        <failoverdomain name="node2" nofailback="0" ordered="0" restricted="1">
            <failoverdomainnode name="proxmox2"/>
        </failoverdomain>
    </failoverdomains>
   <resources>
    <clusterfs name="gfs2" mountpoint="/gfs2" device="/dev/drbd0" fstype="gfs2" force_unmount="1" options="noatime,nodiratime,noquota"/>
   </resources>
   <service autostart="1" name="gfs2-1" domain="node1" exclusive="0">
    <clusterfs ref="gfs2"/>
   </service>
   <service autostart="1" name="gfs2-2" domain="node2" exclusive="0">
    <clusterfs ref="gfs2"/>
   </service>
    <pvevm autostart="1" vmid="100"/>
    <pvevm autostart="1" vmid="101"/>
  </rm>
</cluster>

( all the mess with failoverdomains and 2 services was the only solution I found to use the cluster to mount drbd0 )
So the question is why does the VM get restarted? as I see in rgmanager.log it says it was stopped..

Thank you,

Teodor

mir · Dec 22, 2013

Try change

Code:

[COLOR=#333333]<failoverdomains>[/COLOR]
[COLOR=#333333]<failoverdomain name="node1" nofailback="0" ordered="0" restricted="1">[/COLOR]
[COLOR=#333333]<failoverdomainnode name="proxmox1"/>[/COLOR]
[COLOR=#333333]</failoverdomain>[/COLOR]
[COLOR=#333333]<failoverdomain name="node2" nofailback="0" ordered="0" restricted="1">[/COLOR]
[COLOR=#333333]<failoverdomainnode name="proxmox2"/>[/COLOR]
[COLOR=#333333]</failoverdomain>[/COLOR]
[COLOR=#333333]</failoverdomains>
[/COLOR][COLOR=#333333]<resources>[/COLOR]
[COLOR=#333333]<clusterfs name="gfs2" mountpoint="/gfs2" device="/dev/drbd0" fstype="gfs2" force_unmount="1" options="noatime,nodiratime,noquota"/>[/COLOR]
[COLOR=#333333]</resources>[/COLOR]
[COLOR=#333333]<service autostart="1" name="gfs2-1" domain="node1" exclusive="0">[/COLOR]
[COLOR=#333333]<clusterfs ref="gfs2"/>[/COLOR]
[COLOR=#333333]</service>[/COLOR]
[COLOR=#333333]<service autostart="1" name="gfs2-2" domain="node2" exclusive="0">[/COLOR]
[COLOR=#333333]<clusterfs ref="gfs2"/>[/COLOR]
[COLOR=#333333]</service>[/COLOR][COLOR=#333333]
[/COLOR]

to

Code:

[COLOR=#333333]<failoverdomains>[/COLOR]
[COLOR=#333333]    <failoverdomain name="nodefailover" nofailback="0" ordered="0" restricted="1">[/COLOR]
[COLOR=#333333]        <failoverdomainnode name="proxmox1"/>
[/COLOR][COLOR=#333333]        <failoverdomainnode name="proxmox2"/>[/COLOR]
[COLOR=#333333]    </failoverdomain>[/COLOR]
[COLOR=#333333]</failoverdomains>
[/COLOR][COLOR=#333333]<resources>[/COLOR]
[COLOR=#333333]    <clusterfs name="gfs2" mountpoint="/gfs2" device="/dev/drbd0" fstype="gfs2" force_unmount="1" options="noatime,nodiratime,noquota"/>[/COLOR]
[COLOR=#333333]</resources>[/COLOR]
[COLOR=#333333]<service autostart="1" name="gfs2fs" domain="[/COLOR][COLOR=#333333]nodefailover[/COLOR][COLOR=#333333]" [/COLOR]recovery="relocate"[COLOR=#333333]>[/COLOR]
[COLOR=#333333]    <clusterfs ref="gfs2"/>[/COLOR]
[COLOR=#333333]</service>[/COLOR]

thheo · Dec 22, 2013

The reason I defined 2 domains and 2 services is because it's the only way I found to mount the gfs2 on both nodes simultaneously. Without the 2 service definitions
it only gets mounted on one node:

Code:

Member Status: Quorate


 Member Name                                                ID   Status
 ------ ----                                                ---- ------
 proxmox1                                                       1 Online, Local, rgmanager
 proxmox2                                                       2 Online, rgmanager
 /dev/block/8:17                                                0 Online, Quorum Disk


 Service Name                                      Owner (Last)                                      State
 ------- ----                                      ----- ------                                      -----
 pvevm:100                                         proxmox2                                          started
 pvevm:101                                         proxmox2                                          started
 service:gfs2fs                                    proxmox2                                          started

Basically there is one service gfs2fs controlled by the cluster and after restart now it only shows up on proxmox2. Is there a way to have such a resource/service defined in the cluster
so I can get this shared storage mounted properly at the startup? If I revert to my config I get 2 services listed at clustat: one for each node.
I wonder if maybe this shouldn't be controlled at all by the cluster? And just have init scripts to mount gfs2 right after drbd?
I forgot to mention that I use qcow2 for each VM and the shared storage was presented to PVE as local directory with "shared" option.
So even if the shared storage really works ( with the current definition of the resource in drbd and gfs2 on top it resumes nicely after any crash I was able to simulate until now ) somehow
the VM running memory is not protected by the HA setup ( I guess rgmanager should take care of that from the pvevm definition in rm section ) and the VM cannot continue on the remaining node.
It wouldn't be a big problem with Linux VMs because even with this crash, they resume very fast and with no fs errors ( ext4 journaling helps I guess ), but I plan to run a few W2003 machines
and restarting will take several minutes..

Thank you,
Teodor

thheo · Dec 23, 2013

Maybe I got this HA wrong? Is this normal behaviour? on node failure the VM gets restarted on another node?

dietmar · Dec 23, 2013

thheo said:
Maybe I got this HA wrong? Is this normal behaviour? on node failure the VM gets restarted on another node?

yes, that is how HA works.

wahmed · Dec 26, 2013

dietmar said:
yes, that is how HA works.

I wasnt aware that VMs actually gets restarted after migration when they are in HA mode. I thought they only migrate and thats it.
I never used HA option. Like to keep it all manual labor.

But good to know.

dietmar · Dec 26, 2013

symmcom said:
I wasnt aware that VMs actually gets restarted after migration when they are in HA mode.

A VM is not restarted if you migrate it - that is wrong.

symmcom said:
I thought they only migrate and thats it.
I never used HA option. Like to keep it all manual labor. But good to know.

With HA, a VM gets restarted on the other node after the original node is dead (an thus the VM is no longer running there).

thheo · Jan 1, 2014

I would like to know why is it recommended to have CLVM on top of drbd and then gfs2? Is there something that I miss here? I am about to start
this system in production and I imagined that the simpler the config the better.. but everywhere online I see the standard setup using CLVM in the middle.

mir · Jan 1, 2014

IMHO you are mixing something. Either CLVM and drbd or only use gfs2. I would recommend gfs2 instead of drbd and CLVM since drbd only supports two nodes while gfs2 supports multiple nodes.

thheo · Jan 1, 2014

I only have 2 nodes ( as mentioned above ) and since I don't have a dedicated shared storage I need drbd to provide that along with redundancy. Did I get this wrong?

mir · Jan 1, 2014

gfs2 should be able to apply the same functionality. Whether drbd or gfs2 performs best is up to a test to show.

thheo · Jan 1, 2014

I couldn't find anything related to such a functionality for gfs2. Maybe you mean glusterfs? ( for which I don't know details )

mir · Jan 1, 2014

gfs2 is a shared disk filesystem which present the filesystem to clients as a shared block storage so you can access the storage through eg. iSCSI. So access to your two storage servers can be achieved through MPIO and plain LVM on top. See http://forum.proxmox.com/threads/2808-Multipath-iSCSI or http://pve.proxmox.com/wiki/ISCSI_Multipath. But this is not plain and simple so if you want failover a dedicated storage is highly preferred and especially if you are not having a deep knowledge into storage.

thheo · Jan 1, 2014

I don't have a shared storage, I simulate one with the help of drbd. I use a partition on the internal disk of the Dell server that I mirror it with the help of drbd on the other identical partition on the internal disk of the second server. Multipath either from iscsi or FC or other SAN means a dedicated storage which I don't have for the moment. For a moment I imagined that you'd export via iscsi the internal disks from each server to the other server, but then you'd still not be able to mirror them easily to have a shared storage ( except some weird config with softwareRaid on top ).

thheo · Jan 1, 2014

Regarding the question of whether CLVM is required for GFS2, I found in Redhat's documentation that you can use straight-forward GFS2,
however they don't offer support for this kind of setup in a cluster environment.. I am just afraid not to start this system and then
with live applications on it I would discover some scenario which I couldn't reproduce for testing and I'd have to rebuild the underlying storage layers..
For now I did all kind of crashes to the machines individually or simultaneously and only once in 50 crashes I had to manually recover from drbd
split-brain. GFS2 seems to recover nicely as I see from kernel messages ( journal recovery )..

e100 · Jan 3, 2014

I would suggest to not use GFS2 and simply use LVM on top of DRBD as suggested in the Proxmox wiki:
http://pve.proxmox.com/wiki/DRBD

Proxmox works out of the box with LVM on top of DRBD.
IMHO GFS2 is only adding un-needed complexity to your setup, I suspect it also has a negative performance impact too.

Lastly it is also my opinion that one should not use HA with DRBD, here is why:
You are sleeping, something goes wrong with DRBD and it split-brains (loose cable, flaky network gear, ram error, cosmic rays)
An hour later the node running your HA VM Fails.
Proxmox does what it is told and starts the VM on the other DRBD node.
Problem is the data on the other node is an hour old due to the split-brain.
Worst case you lost an hour of changes, best case you spend lots of labor time and are able to merge the changes from the two divergent systems.

Some might argue that this is unlikely to happen, I'd argue that it is best to plan for the worst.
This is the only scenario that I know of where HA + DRBD can lead to problems.

mir · Jan 4, 2014

e100 said:
Lastly it is also my opinion that one should not use HA with DRBD, here is why:
You are sleeping, something goes wrong with DRBD and it split-brains (loose cable, flaky network gear, ram error, cosmic rays)
An hour later the node running your HA VM Fails.
Proxmox does what it is told and starts the VM on the other DRBD node.
Problem is the data on the other node is an hour old due to the split-brain.
Worst case you lost an hour of changes, best case you spend lots of labor time and are able to merge the changes from the two divergent systems.

Some might argue that this is unlikely to happen, I'd argue that it is best to plan for the worst.
This is the only scenario that I know of where HA + DRBD can lead to problems.

I have seen this situation happen in production which costed me a good nights sleep!

e100 · Jan 4, 2014

There is another scenario.
Imagine DRBD is resyncing data because the other node has split brain or was down for maintenance.
Before the other node is fully synced the node running the HA VM fails.
Now your VM is started on an obviously inconsistent copy of the disk.

I use DRBD and fencing.
If I have a transient failure and the node running the VM fails it gets fenced. If it boots back up the VM auto starts and life goes on with only about 5 min of downtime.
If the node is completely dead my coworker or I will start the VM on the other node for a total of 10-30 min of downtime.

In either case fencing prevents a split brain from occurring. ( well it has so far )

cesarpk · Jan 4, 2014

e100 said:
There is another scenario.
Imagine DRBD is resyncing data because the other node has split brain or was down for maintenance.
Before the other node is fully synced the node running the HA VM fails.
Now your VM is started on an obviously inconsistent copy of the disk.

I use DRBD and fencing.
If I have a transient failure and the node running the VM fails it gets fenced. If it boots back up the VM auto starts and life goes on with only about 5 min of downtime.
If the node is completely dead my coworker or I will start the VM on the other node for a total of 10-30 min of downtime.

In either case fencing prevents a split brain from occurring. ( well it has so far )

Hi e100

i have lvm on top of DRBD in my PVE Nodes, manual fence for my PVE Nodes with VMs in HA, DRBD with email ad in case of replication problems, "but without fencing for DRBD".
Can you show your configuration about of fencing for DRBD?

Best regards
Cesar

thheo · Jan 4, 2014

I was hoping to have some time to perform some extended and extensive tests, but didn't.. thank you e100 and mir for pointing out something
I missed ( actually it's my first try-out of using clustering and drbd ).
I got some important pointers from you and I'll think about a proper drbd fencing solution..

HA migration on node failure restarts VMs

Renowned Member

Famous Member

Renowned Member

Renowned Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Well-Known Member

Renowned Member

We value your privacy