Fully Failed Node A. Guest VM Was Replicated to Node B. How to Bring Up on B?

forbin

Member
Dec 16, 2021
32
5
13
44
I am using PVE 8. I have three nodes, A, B, and C. Guest VMs on A were being replicated every 15 minutes to B. Node A failed.

Per the documentation here: https://pve.proxmox.com/wiki/Storage_Replication

To recover...
  • move both guest configuration files form the origin node A to node B:
    # mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf
    # mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf
However, as stated, node A is fully failed and does not boot. How can we bring up the replicated guest on B?
 
For future readers, you do the move of the config on the running node
Is there a way to do it through the GUI now, or is it still just a file move from the command line as indicated in the documentation?
 
Is there a way to do it through the GUI now, or is it still just a file move from the command line as indicated in the documentation?

Depends on if you use zfs or not. If zfs yes you can do replication via ui + put vm in ha-mode. both pools need to have the same name.
 
There is a risck of data loss with zfs replication and ha:
1. You will loose the date written between the last replication and the ha switching on the other side
2. If zfs Replication was stopped, hanging, crashed (for whatever reason) you might loose significantly more data, if your first node comes back up zfs replication might overwrite the newer data on the first node.

Not wanting to scate you, but consider ha with zfs replication and do monitoring to catch failing replication
 
There is a risck of data loss with zfs replication and ha:
1. You will loose the date written between the last replication and the ha switching on the other side
2. If zfs Replication was stopped, hanging, crashed (for whatever reason) you might loose significantly more data, if your first node comes back up zfs replication might overwrite the newer data on the first node.

Not wanting to scate you, but consider ha with zfs replication and do monitoring to catch failing replication
I can see why #1 would happen, and that is acceptable. However, #2 could be a problem. Let's explore that scenario.

Node A is HA primary and replicating to B.
Replication hangs, crashes, or whatever and nobody notices for days. The data on B gets very old.
Node A crashes.
Node B becomes HA primary with old data.
Node A comes back up.

Why would zfs replicate the old data from B back to A? Isn't there some kind of split-brain mechanism to prevent that?
 
15:00 zfs repl crashes
16:00 host A goes down, ha starts vm on B with data from 15:00
17:00 Admin restarts A, restarts erronously ZFS Replication, B syncs to A
Data between 15:00 and 16:00 is overwritten

I am not saying it will happen, but it can happen ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!