Steps to bring a replicated VM online if a node fails

GuiltyNL

Well-Known Member
Jun 28, 2018
41
15
48
Rotterdam
I have a 2 node cluster and I've replicated a VM from node A to node B.

I was expecting that the VM information itself would be replicated too. I'm used to that in my old Hyper-V solution. However it seems that only the disk is replicated and not the VM information?

I can see the replicated disk on node B. But I'm not sure how to convert that to a VM in case of disaster (like node A going down)?

Do I have to copy the VM .conf file to the folder of the still running node? (And first do a '
pvecm expected 1' to solve read-only issues).

And what do I need to to when the failed node comes back online?
 
Hi,

For your problem, you can follow this steps:
- copy your VM conf on the online node, but usind another VM Id
- rename the vHDD on the online node using VM id at the first step
- start this VM (with VM id)

Using this steps, even if the first node will be online, the replication will not write to the new VM id that is running on second node. Also this old VM will not bring up with the same IP as the VM that is running on the second node.

By the way, all this tasks could be made automatic with monit ;)
 
  • Like
Reactions: GuiltyNL
Hi there! Just for future reference, one way I've found of having a similar behavior as in Hyper-V (much better, actually) is to combine the replication feature with HA. As of Proxmox VE 5.2 I have the following scenario tested:

- Enable replication for one particular VM.
- Create an HA Group containing your nodes.
- Add an HA resource and select your VM (I've chosen "started" as the request state).

It appears to me that once you have High Availability enabled on one VM, the .conf file for this particular VM will be complementary kept somewhere else, available to all nodes in the cluster (or something like this). And once the node running the VM goes down, the HA feature will attempt to start this VM on another node.

I simulated a crash by simply turning off the node where the VM was. After the "failure" was identified, the VM started on the node which has been initially chosen as target for replication.
When the original node came up, the replication direction was automatically reversed and I didn't experience any problems. State-of-the-art I thought it was.

But keep in mind that it's a bit dangerous to have HA with just 2 nodes, because of the split brain effect. If you have such scenario you will have to run the "pvecm expected 1" command on your cluster, as mentioned by GuiltyNL above, otherwise the "node B" will never come up if "node A" fails, because it won't be able to establish quorum. The best practice is to have at least 3 nodes. I believe you can configure replication to multiple nodes, with multiple schedules, which will come in handy for this setup.

If you have separate data disks/arrays, the Ceph implementation seems like a great option too.

Cheers,

Victor.
 
  • Like
Reactions: guletz
Victor,

I can also confirm that this is working like you also find.Only one detali , you can not move back the vm on the failing node when is online (but snapshots must be manual deleted)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!