Hi everyone,
I'd like to share a real scenario I’ve encountered on my Proxmox cluster and get your advice or feedback to improve my disaster recovery strategy.
If pve01 goes down, here’s how I currently handle the failover:
Before bringing pve01 back into the cluster, I want to avoid the following critical risk:
So my questions are:
Thanks in advance for your help and insights
I believe this kind of issue could affect many Proxmox users working with local ZFS and native replication, without shared storage.
Best regards,
I'd like to share a real scenario I’ve encountered on my Proxmox cluster and get your advice or feedback to improve my disaster recovery strategy.
️ Technical Setup
- Proxmox VE cluster (latest stable version)
- 2 nodes: pve01 and pve02
- Each node uses local ZFS storage for VM/CT disks
- ZFS replication enabled every 10 minutes using the built-in Proxmox replication tool — in both directions (pve01 → pve02 and pve02 → pve01)
- Live migration between nodes is working fine
- No shared storage, and manual HA/failover
Current Procedure When
If pve01 goes down, here’s how I currently handle the failover:
- I boot pve01 using a rescue system (e.g. SystemRescue), without rejoining it to the cluster right away.
- I manually replicate from pve01 to pve02 to transfer any remaining ZFS snapshots that weren't replicated before the crash.
- I copy the VM/CT configuration files from pve01 to pve02 (/etc/pve/qemu-server/ or lxc/).
- I start the VMs/CTs on pve02, using the ZFS datasets that were replicated.
- The VMs are now running on pve02, and the most up-to-date data is on this node.
My Concern
Before bringing pve01 back into the cluster, I want to avoid the following critical risk:
When pve01 rejoins the cluster, the automatic ZFS replication might resume from pve01 to pve02, potentially overwriting recent data on pve02 with outdated snapshots from pve01.
So my questions are:
- What is the best way to avoid this risk and make sure the correct data direction is preserved?
- How can I safely reverse the replication direction, like what Proxmox does automatically during live VM migration?
- Is there a recommended procedure or automation to handle this type of failover cleanly and protect the data?
Thanks in advance for your help and insights
I believe this kind of issue could affect many Proxmox users working with local ZFS and native replication, without shared storage.
Best regards,