[Real Scenario] Risk of Data Loss After Node Reconnection with ZFS Replication Enabled (Proxmox Cluster)

Nov 19, 2024
13
3
3
Hi everyone,

I'd like to share a real scenario I’ve encountered on my Proxmox cluster and get your advice or feedback to improve my disaster recovery strategy.



️ Technical Setup​

  • Proxmox VE cluster (latest stable version)
  • 2 nodes: pve01 and pve02
  • Each node uses local ZFS storage for VM/CT disks
  • ZFS replication enabled every 10 minutes using the built-in Proxmox replication tool — in both directions (pve01 → pve02 and pve02 → pve01)
  • Live migration between nodes is working fine
  • No shared storage, and manual HA/failover



⚙️ Current Procedure When​


If pve01 goes down, here’s how I currently handle the failover:
  1. I boot pve01 using a rescue system (e.g. SystemRescue), without rejoining it to the cluster right away.
  2. I manually replicate from pve01 to pve02 to transfer any remaining ZFS snapshots that weren't replicated before the crash.
  3. I copy the VM/CT configuration files from pve01 to pve02 (/etc/pve/qemu-server/ or lxc/).
  4. I start the VMs/CTs on pve02, using the ZFS datasets that were replicated.
  5. The VMs are now running on pve02, and the most up-to-date data is on this node.



❓ My Concern​


Before bringing pve01 back into the cluster, I want to avoid the following critical risk:

❗ When pve01 rejoins the cluster, the automatic ZFS replication might resume from pve01 to pve02, potentially overwriting recent data on pve02 with outdated snapshots from pve01.

So my questions are:
  • What is the best way to avoid this risk and make sure the correct data direction is preserved?
  • How can I safely reverse the replication direction, like what Proxmox does automatically during live VM migration?
  • Is there a recommended procedure or automation to handle this type of failover cleanly and protect the data?



Thanks in advance for your help and insights
I believe this kind of issue could affect many Proxmox users working with local ZFS and native replication, without shared storage.


Best regards,
 
As far as I know, ZFS can’t replicate snapshots if the source and destination snapshots aren’t at the same epoch nor can it have an active-active setup. So let’s say you’ve snapshotted and transferred snapshot 1-12, both sides disconnect, you make both sides ‘live’ and make a ‘different’ snapshot 13, you can’t then “merge” them from ‘each other’s side’ nor can you send any subsequent snapshot to the other side, the snapshot just doesn’t make sense anymore. If you do force the replication from one side to the other, then any new snapshots/data differences will be rolled back on the destination to a ‘shared common state’.

In all these scenarios, the receiving pool/volume is always going to be ‘read only’ until something intervenes and makes the snapshot read/write at which point you have 2 distinct pools/volumes that can no longer be replicated, one of them must be selected to be destroyed and a full sync done or rollback to the last common point.

You’re basically describing a split brain scenario, but with data. That is why any type of HA requires at least three nodes, so at least 2 can be in agreement on the state of the system. You should never ‘automatically’ do anything in a split brain scenario.
 
Last edited:
You’re basically describing a split brain scenario, but with data.
I only know of that kind of split brain. What would be one without data?

If pve01 goes down, here’s how I currently handle the failover:
  1. I boot pve01 using a rescue system (e.g. SystemRescue), without rejoining it to the cluster right away.
  2. I manually replicate from pve01 to pve02 to transfer any remaining ZFS snapshots that weren't replicated before the crash.
  3. I copy the VM/CT configuration files from pve01 to pve02 (/etc/pve/qemu-server/ or lxc/).
  4. I start the VMs/CTs on pve02, using the ZFS datasets that were replicated.
  5. The VMs are now running on pve02, and the most up-to-date data is on this node.
That's no failover, that's kind of a switch-over.

Normally a failover would be to active the replicated VMs on pve02 without doing anything on pve01 and then try to fix pve02.