HA migration on node failure restarts VMs

I was hoping to have some time to perform some extended and extensive tests, but didn't.. thank you e100 and mir for pointing out something
I missed ( actually it's my first try-out of using clustering and drbd ).
I got some important pointers from you and I'll think about a proper drbd fencing solution..

Hi thheo

The best practice for VMs in HA with DRBD are:
1- Have two partitions for DRBD
2- Always LVM on top of DRBD for each DRBD partition
3- The first DRBD partition is for VMs of the first PVE Node
4- The second DRBD partition is for VMs of the second PVE Node

Why is the best?

For security reasons:
1- If you have any problem of replication in a partition, you will can discard the data on the DRBD partition with problems and re-sync without affect to the other partition.
2- In this way, you will can to do sync the DRBD partition on-Line without need of turn off the VMs on any PVE Nodes, and without need to make previously backups
3- If you have GFS2 for your DRBD partition and into of this partition have running VMs of the two PVE Nodes, think... if the PVE Nodes lose the DRBD replication, how fix it if the two PVE Nodes are writing data into the same partition that lost the property of replication

For speed reasons:
1- LVM is more faster that CLVM
2- ext3 is more faster that GFS2
Then, The VMs will get more speed of access to the data on the partition, and technically the VMs don't need access to a cluster file system

For reasons of practicality:
1- In this way, is very easy to fix the problems of DRBD replication, ie in this case i only need run simple commands, and in very short time my DRBD partition will be synchonized (I don't use the discard option of DRBD for gain time), and all this without turn off any of the VMs on both PVE Nodes.

And finally i have this configuration installed in many production enviroments with excelent results and very tested.

Only i need to know how configure fencing for DRBD, then, my configuration will be complete with out any point of failure..
If e100 or anybody can help me with this matter!!!.

Awaiting that this information is util for you, i say see you soon

Best regards
Cesar

Re.Edit: It is highly recommended to make practice of DRBD tuning for performance reasons (Follow the directions for your version of DRBD):
http://www.drbd.org/users-guide/s-throughput-tuning.html
http://www.drbd.org/users-guide/s-latency-tuning.html
 
Last edited:
Just configure fencing according to Proxmox wiki http://pve.proxmox.com/wiki/Fencing

Nothing else needs done.
Once you configure fencing if a node fails it will get fenced.

But I thought you just mentioned that you can be in a situation where from the cluster perspective everything is fine ( nodes inter-operate properly ), but drbd fails for some reason.
In this situation fencing will not get triggered right? Since you gave my that perspective, I have some trouble sleeping :)
Now the only question remaining: Do I setup a fencing for DRBD which is not related to the cluster configuration (like in: http://www.drbd.org/users-guide-8.3/s-fence-peer.html ),
or exactly the opposite: do I configure something in the cluster so that it will take care of this situation
( I don't know my options yet, but there are some hints on drbd.org: http://www.drbd.org/users-guide-8.3/s-heartbeat-dopd.html )
Btw, in such a case ( drbd not working, but cluster working ) I am not even sure that fencing is the proper solution. How would you know which node to fence? The cluster takes the decision
of voting for fencing from a minimum of 2 out of 3.
It makes a lot of sense what cesarpk is saying.. it's like creating zones so they will not overlap even if you get in such a situation.
In my case since I would only need the HA setup, to make sure I recover pretty fast with services without manual intervention, I wonder if it is not better to have a primary-secondary setup
for drbd?
Ah and btw cesarpk, I did some tests with fio using the iometer job example and with drbd+gfs2+VM I get 97% out of the host performance, so from this perspective I think it's good enough
( maybe the gap increases with more powerful hardware ).
But strictly talking about GFS2, I imagined that being a filesystem that can be used in a cluster and using dlm, somehow it will lock properly through this channel so there will be no overlap even
if drbd link is broken. This is something I wanted to test, deploying some fixed-length files and some variable ones so I can see if recovering from split-brain it will overwrite from the "wrong" node.
But maybe this is why Redhat only support GFS2 on top of CLVM.. to have the proper locking mechanisms so you don't have to care of underlying storage, like we are trying now..
I hope I explained my thoughts properly :)
 
Last edited:
Although I read this before, it seems I forgot some important things mentioned here:
https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
"Hooking DRBD Into The Cluster's Fencing" for example: "We will use a script, written by Lon Hohberger of Red Hat. This script will capture fence calls from DRBD and in turn calls the cluster's fence_node against the opposing node. It this way, DRBD will avoid split-brain without the need to maintain two separate fence configurations"

LE:
I have added this to my drbd configuration and I will test this for the next 2 days:
In handlers section: fence-peer "/usr/lib/drbd/rhcs_fence";In disk section: fencing resource-and-stonith;
 
Last edited:
I simply use zabbix to monitor DRBD and alert me when it is broken.

What if I was live migrating VMS to the other node when the DRBD issue happens? Not sure if an automated solution would ever make the right choice.

I use DRBD to provide storage replication for my VMS. If it breaks my VMs keep running, I only loose storage replication. If DRBD is configured to fence to prevent split brain now it is causing VMS to fail which is not what I wasn't too have happen.
 
Hi thheo

The best practice for VMs in HA with DRBD are:
1- Have two partitions for DRBD
2- Always LVM on top of DRBD for each DRBD partition
3- The first DRBD partition is for VMs of the first PVE Node
4- The second DRBD partition is for VMs of the second PVE Node

Why is the best?

For security reasons:
1- If you have any problem of replication in a partition, you will can discard the data on the DRBD partition with problems and re-sync without affect to the other partition.
2- In this way, you will can to do sync the DRBD partition on-Line without need of turn off the VMs on any PVE Nodes, and without need to make previously backups
3- If you have GFS2 for your DRBD partition and into of this partition have running VMs of the two PVE Nodes, think... if the PVE Nodes lose the DRBD replication, how fix it if the two PVE Nodes are writing data into the same partition that lost the property of replication

For speed reasons:
1- LVM is more faster that CLVM
2- ext3 is more faster that GFS2
Then, The VMs will get more speed of access to the data on the partition, and technically the VMs don't need access to a cluster file system

For reasons of practicality:
1- In this way, is very easy to fix the problems of DRBD replication, ie in this case i only need run simple commands, and in very short time my DRBD partition will be synchonized (I don't use the discard option of DRBD for gain time), and all this without turn off any of the VMs on both PVE Nodes.

It hit me now when I was reconfiguring my setup to be much like yours, that you really meant LVM instead of CLVM. I am not sure if I understand this properly.. are you sure
you have no problems because you are not using cluster locking? I understood the point that you have 2 separate resources, one for a node and the other one for the other node
so usually nothing is wrong and you get your storage doubled. But what happens in case of a HA restart on the other node? No locking problems? Can you just start using the lvm
from the first node on the second node in case of first node failure? Could you be in a situation where the first node is writing on its disconnected drbd resource via lvm and the second one which somehow disconnected from the cluster also writes on its disconnected drbd resource?
 
If you use Proxmox to manage things it is not a problem. (eg don't mess with LVM Manually unless you are sure it is ok)
Proxmox does it's own locking.

I too had the same concern that we need cLVM, the Proxmox devs did implement cLVM back in 2.0 and it caused more issues than it solved so it was removed.
I've been using DRBD with LVM since Proxmox 1.4, never had a catastrophic issue.
 
It hit me now when I was reconfiguring my setup to be much like yours, that you really meant LVM instead of CLVM. I am not sure if I understand this properly.. are you sure

CLVM = Clustered Logical Volume Manager, please see:
https://access.redhat.com/site/docu...ager_Administration/LVM_Cluster_Overview.html

In Short: CLVM is for a shared storage (as a SAN) for manage correctly the LVM into of the group of access of Nodes simultaneous.
In my case with DRBD, i only use LVM and works very well (remember that only a PVE node have access to each DRBD partition, never two Nodes simultaneously)

But what happens in case of a HA restart on the other node? No locking problems? Can you just start using the lvm from the first node on the second node in case of first node failure?

Yes and yes - lvm don't show any problem. Remember that LVM must be on top of DRDB.
A plus: in live I can enlarge the Virtual Disk of the VMs always that DRBD partition have free space, and their partitions if the S.O. virtualized it allows.

Could you be in a situation where the first node is writing on its disconnected drbd resource via lvm and the second one which somehow disconnected from the cluster also writes on its disconnected drbd resource?

Yes, for example if i disconnect brutally all net cables of Replication of all DRDB resources.....
... But in this case, if i disconnect brutally all net cables of Replication of all DRDB resources (i had tested much time ago), all my VMs will continue to operate without any problem, but DRDB don't will make replication of volumes.

After this situation and without turn off nothing i will reconnect all net cables of Replication of DRDB, after i will run some simple commands in the PVE Nodes for that DRBD resynchronize only the difference of data loads, and after very short time all my DRBD resources will be perfectly synchronized.

Everything I've said above works very well in a production enviroment that is previously tested !!!.

Best regards
Cesar
 
Last edited:
Ok, but in this case you don't have an option for live migration right?

Live Migration of VMs (vMotion in VMware) work perfectly, i don't see any drawback (based on the configuration that i was telling in my previous posts)
 
It works nicely, thank you. I am trying different worst case scenario now to see if the data gets corrupted..
Thank you!
 
...
After this situation and without turn off nothing i will reconnect all net cables of Replication of DRDB, after i will run some simple commands in the PVE Nodes for that DRBD resynchronize only the difference of data loads, and after very short time all my DRBD resources will be perfectly synchronized.

Everything I've said above works very well in a production enviroment that is previously tested !!!.

Best regards
Cesar

hi Cesar,

did you plug off one of the nodes?

I try to switch from vserver (it runs verry well with active/passive DRBD setup) but i had some bad times. I have 2 Nodes and a few VM's running on each node and then i unplug one node, the desaster starts.... :-(

Can't start the VM's on the other node because the VM's listet on the offline node and they can't manualy migrate as long the one node is offline! :-(

After the offline node restart and come back he not found some qfiles (Discs) of some VM's and can't start some VM's on the second node either!

I dont know the exactly problem but i not get the point because i have a vmware cluster with 2 nodes and there is no problem about the sync. Both nodes are in the virtual datacenter and if one switched off u can start the VM on the other.?

I try this proxmox setup with 2 Nodes and DRBD with LVM, i not installed a quorum, so maybe with a Quorum-NFS-Mount from another Server it will work perfectly? Or need a thirt Proxmox Node?
I want to try it again because Proxmox have a better Management tool than vserver but important is the absolutly savety of the data!

Can somebody tell something about a NFS-Quorum from a NAS?

Regards

Richard
 
Last edited:
hi Cesar,

did you plug off one of the nodes?

For test of failure of DRBD, I did Plug off all NICs for DRBD, and after by CLI I did resynchronize the DRBD partitions without problems.

I try to switch from vserver (it runs verry well with active/passive DRBD setup) but i had some bad times. I have 2 Nodes and a few VM's running on each node and then i unplug one node, the desaster starts.... :-(

-For PVE, DRBD always must be in active/active mode, then you can use live migration and HA.

Note important about of Quorum:
- But if you have only two PVE nodes and no three or more in the same PVE cluster, then you will must have a line that say:
<cman keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1" expected_votes="1"/>
- With the line above, you are saying to the PVE Nodes that only 1 vote of Quorum will be enough for work. Of this manner, always you can work with only a PVE Node.

For get that HA works well with only two PVE Nodes:
- You must have PVE in cluster and with Quorum, of other way HA don't work
- Also will be best add a configuration of "fence_manual" for the PVE Hosts in your cluster.conf file. Of this manner (talking about of fence manual), if you have a problem with a PVE Node, you must disconnect the electrical power in the PVE Node with problems, and "only after" by CLI in the PVE Node that is alive you must run "fence_ack_manual <name_of_PVE_Node_that_was_disconnected>", then the VM(s) start in the PVE Node that is alive (This topic has been discussed several times in this forum if you want to learn how do it).
- Remember that rgmanager and "join fence" in your PVE Nodes that are in HA must are enabled (see the wiki of Proxmox about of this topic)

About of PVE Cluster and Switch:
- If the Switch of your Virtual Center supports Multicast, the PVE Cluster will works well, if not, you must configure it in unicast mode (see the wiki of Proxmox about of this topic)

Can't start the VM's on the other node because the VM's listet on the offline node and they can't manualy migrate as long the one node is offline! :-(

The network for DRBD must be in different(s) NIC(s) to the network of PVE, and the PVE Node that is alive must have quorum (i believe that it is your problem, please read "note important about of Quorum" above in this post).

After the offline node restart and come back he not found some qfiles (Discs) of some VM's and can't start some VM's on the second node either!

- What is qfiles?... qcow2 files?
- For get HA in PVE, I say that must use LVM on top of DRBD (see the wiki of Proxmox with DRBD), of this manner, you can only use the raw format and not qcow2 for the virtual hard disks of your VMs, and the virtual disks of the VMS must be in the DRBD partition.
- You can't have qcow2 files in LVM on top of DRBD with HA enabled for VMs, only raw format.
- Optionally (and not necessarily), the VMs that don't have HA enabled can have his virtual disks on other partition that aren't the DRBD partition, and of this manner, you can have qcow2 files for your virtual disks out of DRBD.
- About of you can't find the files of the virtual disks,
First: You must know that the virtual disks in the DRBD partition are "IN RAW FORMAT", and each virtual disks is a logical volume.
Second: Only the Virtual Disks that are raise for PVE, you can find it in the PVE Node that have raised.
Third: If you want find it, and the DRBD partition is in active/active mode, then you can run by CLI "vgdisplay", and will see the location

I dont know the exactly problem but i not get the point because i have a vmware cluster with 2 nodes and there is no problem about the sync. Both nodes are in the virtual datacenter and if one switched off u can start the VM on the other.?

- Only two nodes for get HA?, than strange, i believe that you don't have HA. But this isn't relevant for this thread.

I try this proxmox setup with 2 Nodes and DRBD with LVM, i not installed a quorum, so maybe with a Quorum-NFS-Mount from another Server it will work perfectly? Or need a thirt Proxmox Node?

- In a PVE cluster, Quroum "always is necessary", and when you add a PVE Node to a PVE Cluster, the quorum is added automatically.
- "Quorum-NFS-Mount" don't exist in a PVE Cluster. See above in this post my note "Note important about of Quorum".

I want to try it again because Proxmox have a better Management tool than vserver but important is the absolutly savety of the data!

- PVE is fantastic, but you will need learn many things about that.
- About of "absolutely safety of the data!", i will give a data: Did you know that DRBD have a command for verify data of replicated volumes on-Line or in-Hot (without power off anything)?
Please see this link: http://www.drbd.org/users-guide-8.3/s-use-online-verify.html

Can somebody tell something about a NFS-Quorum from a NAS?

- This question was answered above

Best regards and good luck with your PVEs

Cesar
 
Last edited:
For test of failure of DRBD, I did Plug off all NICs for DRBD, and after by CLI I did resynchronize the DRBD partitions without problems.

-For PVE, DRBD always must be in active/active mode, then you can use live migration and HA.

I see this first time and was suprised how active/active can work! :)
I try to simulate the worst case and unplug the powercord from one node. I want to see whats happen after cold shutdown, DRBD sync, filesystemcheck, VM filesystemcheck and how VM starts. :)

Note important about of Quorum:
- But if you have only two PVE nodes and no three or more in the same PVE cluster, then you will must have a line that say:
<cman keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1" expected_votes="1"/>
- With the line above, you are saying to the PVE Nodes that only 1 vote of Quorum will be enough for work. Of this manner, always you can work with only a PVE Node.

Hmm, but i read somewhere (here?) it's not a safe way for your data to do this?
U mean u can migrate VM's from the offline node to the online? But the offline node is not clickable and all VM's on this node are gray? It's have to be done some terminal work? :confused:

- What is qfiles?... qcow2 files?
- For get HA in PVE, I say that must use LVM on top of DRBD (see the wiki of Proxmox with DRBD), of this manner, you can only use the raw format and not qcow2 for the virtual hard disks of your VMs, and the virtual disks of the VMS must be in the DRBD partition.
- You can't have qcow2 files in LVM on top of DRBD with HA enabled for VMs, only raw format.
- Optionally (and not necessarily), the VMs that don't have HA enabled can have his virtual disks on other partition that aren't the DRBD partition, and of this manner, you can have qcow2 files for your virtual disks out of DRBD.
- About of you can't find the files of the virtual disks,
First: You must know that the virtual disks in the DRBD partition are "IN RAW FORMAT", and each virtual disks is a logical volume.
Second: Only the Virtual Disks that are raise for PVE, you can find it in the PVE Node that have raised.
Third: If you want find it, and the DRBD partition is in active/active mode, then you can run by CLI "vgdisplay", and will see the location

Ok i think i mean qcow2 files and i'm wrong. After the node comes back proxmox don't want to start some of the VM's because some files are missing.? But i'm not sure what type of files!
I had not checked it realy where are the files or lv's, the DRBD was in sync and second proxmox has startet, so i don't know where the files or logical volumes are gone. Normaly they should be on the running node and after the second node comes back and DRBD is in sync the files should be ok.? I only had tested what can happen if a node die unexpected and how to bring the cluster back online. The first look tells me everthing ok because DRBD was in sync and running primary/primary, the nightmare starts when i try to start some VM's, some working - some not, i don't know exactly why!

- In a PVE cluster, Quroum "always is necessary", and when you add a PVE Node to a PVE Cluster, the quorum is added automatically.
- "Quorum-NFS-Mount" don't exist in a PVE Cluster. See above in this post my note "Note important about of Quorum".

I read mir's link already before but had only 2 machines, now i get 2 new machines and one nas and this nas can export iscsi, so maybe i give them a try again and put the quorum disk on the iscsi mount?
Have to think about it.

- PVE is fantastic, but you will need learn many things about that.
- About of "absolutely safety of the data!", i will give a data: Did you know that DRBD have a command for verify data of replicated volumes on-Line or in-Hot (without power off anything)?
Please see this link: http://www.drbd.org/users-guide-8.3/s-use-online-verify.html

Best regards and good luck with your PVEs

Cesar

U absolutely right! It's a great software and looks nice but it also looks a little bit complex?
My vserver installations runs years without any big problems and after configuration it was easy to handle with the active/passive config and heartbeat. OK ist a standby config and second node only working if first node fail, i know.

@mir
Thanks for the link, read already month ago but not remember. :-(
 
Last edited:
Hmm, but i read somewhere (here?) it's not a safe way for your data to do this?(

- It Is correct that don't is the best scenery, but if you have configured "manual fence", you have the control of apply HA or not manually when you want for that the VMs start in the PVE Node that is alive. This will give much security because you can analize the situation before apply "manual fence".

U mean u can migrate VM's from the offline node to the online? But the offline node is not clickable and all VM's on this node are gray? It's have to be done some terminal work? :confused:(

- For apply live migration of VM is necessary:
1- These PVE Nodes must have quorum
2- These PVE Nodes must are in the same PVE Cluster
3- The VM must be running
4- DRBD must work perfectly.

- For apply HA you must do in this order:
First: you must configure correctly the HA options that i said in my previous post (with much read and understanding of the proxmox wiki)
second: The PVE Node with problems must be power off manually (then from this moment, live migration isn't possible, only you can apply HA)
Third: In the PVE Node that is alive you must run by CLI "/usr/sbin/fence_ack_manual <name_of_PVE_Node_that_power off>"
Note: If you want apply HA without interaction human, you must have a power Switch (PDU) and configure your cluster.conf file, but with only two PVE Nodes in the same PVE Cluster this is very dangerous and isn't recommended.

Normaly they should be on the running node and after the second node comes back and DRBD is in sync the files should be ok.? I only had tested what can happen if a node die unexpected and how to bring the cluster back online. The first look tells me everthing ok because DRBD was in sync and running primary/primary, the nightmare starts when i try to start some VM's, some working - some not, i don't know exactly why!

If you configure HA correctly for each VM, then when a PVE Node dies, these VMs start in the second PVE Node, but you need configured all this in the cluster.conf file (including the manual fence), enable the rgmanager service and "join fence", the proxmox wiki explain about that, and about of manual fence, in old threads of this forum you will can read about that (i don't remember for now).

I read mir's link already before but had only 2 machines, now i get 2 new machines and one nas and this nas can export iscsi, so maybe i give them a try again and put the quorum disk on the iscsi mount?
Have to think about it.

- If you have three or more PVE Nodes in the same PVE Cluster:
1- You don't must have "two_node="1" expected_votes="1" in your cluster.conf file (and before that add the third PVE Node to PVE Cluster)
2- When you have little PVE Nodes in a PVE Cluster (for example 4 - not 3), it is preferable have votes of Quorum odd, this is for avoid a conflict in difficult cases when is necessary compare votes of Quorum (in this case add Iscsi quorum disk is good idea). But if you have 3 PVE Nodes in the same PVE Cluster, add a Quorum vote with a Iscsi disk isn't recommended because you get 4 votes in total, then will be impossible to get a majority vote in case of anything serious problem.
3- Very Important: Only if you have 3 or more PVE Nodes in the same PVE Cluster with HA configured, your cluster.conf file must have a aditional configuration: "Failover domain", of this manner if you have HA enabled, after of a problem with a PVE Node, the VMs will start in the correct PVE Node that must have DRBD enabled into the same (Internet have many information about Failover domain in cluster HA).

@mir
Thanks for the link, read already month ago but not remember. :-(

- The suggestion of Mir is excellent idea if all these conditions are met:
1- You have little PVE Nodes in the same PVE Cluster: 2, 4 or 6 for example
2- If the sum of your PVE Nodes in a PVE Cluster is an even number

Note: If you have the manual fence for your PVE Nodes, you have all control of apply fence manually to your PVE Nodes when you want, then the Quorum votes isn't as relevant. Unless you have a "PDU" or "Power Switch", in this case the PVE Cluster analyze the Quorum votes (have a most of Quorum votes is necessary, odd of Quorum votes in total is necessary) and after, the PVE Cluster tell to PDU that power off the energy electrical to the PVE Node that have problems. in this mode don't exist human intervention for apply HA (always that the previous configurations be correct).

Good luck with your PVE Cluster

Best regards
Cesar
 
Last edited:
A VM is not restarted if you migrate it - that is wrong.



With HA, a VM gets restarted on the other node after the original node is dead (an thus the VM is no longer running there).

Sorry for maybe beating on the dead horse but I thought HA live migrate the VM if one node goes down? Is that not true even with VE 3.3? I just want to know if restart VM is expected or I did something wrong. Thank you.
 
Sorry for maybe beating on the dead horse but I thought HA live migrate the VM if one node goes down? Is that not true even with VE 3.3? I just want to know if restart VM is expected or I did something wrong. Thank you.

I was confused about the term "HA" at the beginning. I usually handle the networking part and I am used with technologies from Cisco: HA with NSF and SSO, which means a full controller is actively replicated to the standby one and it can do stateful-switch-over if it crashes. In Proxmox/RHCS case HA is just about getting the node operational on a "sane" node by restarting it from the shared storage. I have an up&running solution from Stratus Technologies in which one server consists of 2 identical server blades and they get fully replicated in terms of memory and I/O operations so you get this stateful-switch-over. It would be nice that KVM would go in this direction: http://www.stratus.com/Products/Platforms/ftServerSystems/ftServerArchitecture
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!