Fully Failed Node A. Guest VM Was Replicated to Node B. How to Bring Up on B?

forbin · Nov 20, 2023

I am using PVE 8. I have three nodes, A, B, and C. Guest VMs on A were being replicated every 15 minutes to B. Node A failed.

Per the documentation here: https://pve.proxmox.com/wiki/Storage_Replication

To recover...

move both guest configuration files form the origin node A to node B:
# mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf
# mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf

However, as stated, node A is fully failed and does not boot. How can we bring up the replicated guest on B?

forbin · Nov 20, 2023

Lol, never mind. Figured it out.

ubu · Nov 20, 2023

For future readers, you do the move of the config on the running node

forbin · Nov 20, 2023

ubu said:
For future readers, you do the move of the config on the running node

Is there a way to do it through the GUI now, or is it still just a file move from the command line as indicated in the documentation?

jsterr · Nov 20, 2023

forbin said:
Is there a way to do it through the GUI now, or is it still just a file move from the command line as indicated in the documentation?

Depends on if you use zfs or not. If zfs yes you can do replication via ui + put vm in ha-mode. both pools need to have the same name.

forbin · Nov 20, 2023

jsterr said:
Depends on if you use zfs or not. If zfs yes you can do replication via ui + put vm in ha-mode. both pools need to have the same name.

zfs=yes. I put a guest VM in ha mode for testing. Thanks!

ubu · Nov 20, 2023

There is a risck of data loss with zfs replication and ha:
1. You will loose the date written between the last replication and the ha switching on the other side
2. If zfs Replication was stopped, hanging, crashed (for whatever reason) you might loose significantly more data, if your first node comes back up zfs replication might overwrite the newer data on the first node.

Not wanting to scate you, but consider ha with zfs replication and do monitoring to catch failing replication

forbin · Nov 20, 2023

ubu said:
There is a risck of data loss with zfs replication and ha:
1. You will loose the date written between the last replication and the ha switching on the other side
2. If zfs Replication was stopped, hanging, crashed (for whatever reason) you might loose significantly more data, if your first node comes back up zfs replication might overwrite the newer data on the first node.

Not wanting to scate you, but consider ha with zfs replication and do monitoring to catch failing replication

I can see why #1 would happen, and that is acceptable. However, #2 could be a problem. Let's explore that scenario.

Node A is HA primary and replicating to B.
Replication hangs, crashes, or whatever and nobody notices for days. The data on B gets very old.
Node A crashes.
Node B becomes HA primary with old data.
Node A comes back up.

Why would zfs replicate the old data from B back to A? Isn't there some kind of split-brain mechanism to prevent that?

ubu · Nov 20, 2023

15:00 zfs repl crashes
16:00 host A goes down, ha starts vm on B with data from 15:00
17:00 Admin restarts A, restarts erronously ZFS Replication, B syncs to A
Data between 15:00 and 16:00 is overwritten

I am not saying it will happen, but it can happen

Search

Search

Fully Failed Node A. Guest VM Was Replicated to Node B. How to Bring Up on B?

forbin

Member

forbin

Member

ubu

Renowned Member

forbin

Member

jsterr

Renowned Member

forbin

Member

ubu

Renowned Member

forbin

Member

ubu

Renowned Member

We value your privacy