Steps to bring a replicated VM online if a node fails

GuiltyNL · Jul 5, 2018

I have a 2 node cluster and I've replicated a VM from node A to node B.

I was expecting that the VM information itself would be replicated too. I'm used to that in my old Hyper-V solution. However it seems that only the disk is replicated and not the VM information?

I can see the replicated disk on node B. But I'm not sure how to convert that to a VM in case of disaster (like node A going down)?

Do I have to copy the VM .conf file to the folder of the still running node? (And first do a '
pvecm expected 1' to solve read-only issues).

And what do I need to to when the failed node comes back online?

guletz · Jul 5, 2018

Hi,

For your problem, you can follow this steps:
- copy your VM conf on the online node, but usind another VM Id
- rename the vHDD on the online node using VM id at the first step
- start this VM (with VM id)

Using this steps, even if the first node will be online, the replication will not write to the new VM id that is running on second node. Also this old VM will not bring up with the same IP as the VM that is running on the second node.

By the way, all this tasks could be made automatic with monit

Victor Lopes · Nov 6, 2018

Hi there! Just for future reference, one way I've found of having a similar behavior as in Hyper-V (much better, actually) is to combine the replication feature with HA. As of Proxmox VE 5.2 I have the following scenario tested:

- Enable replication for one particular VM.
- Create an HA Group containing your nodes.
- Add an HA resource and select your VM (I've chosen "started" as the request state).

It appears to me that once you have High Availability enabled on one VM, the .conf file for this particular VM will be complementary kept somewhere else, available to all nodes in the cluster (or something like this). And once the node running the VM goes down, the HA feature will attempt to start this VM on another node.

I simulated a crash by simply turning off the node where the VM was. After the "failure" was identified, the VM started on the node which has been initially chosen as target for replication.
When the original node came up, the replication direction was automatically reversed and I didn't experience any problems. State-of-the-art I thought it was.

But keep in mind that it's a bit dangerous to have HA with just 2 nodes, because of the split brain effect. If you have such scenario you will have to run the "pvecm expected 1" command on your cluster, as mentioned by GuiltyNL above, otherwise the "node B" will never come up if "node A" fails, because it won't be able to establish quorum. The best practice is to have at least 3 nodes. I believe you can configure replication to multiple nodes, with multiple schedules, which will come in handy for this setup.

If you have separate data disks/arrays, the Ceph implementation seems like a great option too.

Cheers,

Victor.

guletz · Nov 6, 2018

Victor,

I can also confirm that this is working like you also find.Only one detali , you can not move back the vm on the failing node when is online (but snapshots must be manual deleted)

Search

Search

Steps to bring a replicated VM online if a node fails

GuiltyNL

Well-Known Member

guletz

Famous Member

Victor Lopes

New Member

guletz

Famous Member