DR Options Using Proxmox

Tacioandrade

Renowned Member
Sep 14, 2012
129
20
83
Vitória da Conquista, Brazil
Hi everyone,


I've been a Proxmox user since version 2.3, and it has been my go-to virtualization platform since version 3.0. I'm truly impressed with how far Proxmox has come in such a short time, especially from version 6.0 onward.

Over the years, my projects have grown — what used to be all single-node setups (sometimes with HA, sometimes without) have now evolved into full cluster environments. Lately, I've started experimenting with multi-datacenter projects for Disaster Recovery (DR).


Recently, I've been approached with a few potential DR projects using Proxmox — typically involving two datacenters (one at headquarters and another at a branch), or even small ISPs with multiple PoPs (Points of Presence), looking to keep their infrastructure running in case one site goes offline.


The general idea, based on client discussions, would be to use two separate clusters with no direct link between them, so that if one environment gets compromised (e.g., by an intrusion), the backup site remains unaffected.


I started doing some lab testing using the qm remote-migrate command, which works great for migration, but not so well for replication, for a few reasons:

  1. It always performs a full migration of the VM — in my tests, it doesn’t do incremental transfers like ZFS replication does;
  2. It doesn’t seem to leave the VM “on standby” at the destination — it migrates and immediately boots it on the target host;
  3. The destination VM ends up with Auto-Start enabled, meaning that after a power failure, it might automatically start on the other side, which could cause issues.

I also looked into manual ZFS replication between clusters, but the main drawback is that it requires scripting and lacks an easy-to-manage interface for less experienced Linux/Shell administrators.


Finally, I noticed that Veeam offers a VMware replication feature that might be adaptable for this use case. However, I have no real experience with Veeam (I've always used Proxmox, and before that XenServer), so I’m not sure if it’s a practical or cost-effective option.


That’s why I’m opening this thread — to discuss with the community the best ways to approach DR using Proxmox, and to hear from anyone who’s implemented similar projects. I’d like to understand what worked for you and what didn’t, to see what might fit my own environment.


Thanks in advance to everyone who contributes to the discussion!
 
Perhaps the very new (unreleased!) Datacenter Manager can help you in some regard: https://forum.proxmox.com/threads/proxmox-datacenter-manager-0-9-beta-released.171742/
Thank you very much for participating in this discussion. In this case, I believe PDM isn't the best option for this at this time. I say this because I'm using it in the lab and have even used it for some production migrations, but it only performs full VM migrations.

There's no way to send only the incremental VM migration. Therefore, migrating an entire datacenter back and forth once a day would destroy the lifespan of SSDs in a few months. Imagine writing to a pool of 1TB or more of data daily. Furthermore, the bandwidth required between the two datacenters would also be very high, which makes it unfeasible for most projects.
 
Maybe utilize PBS in this case. If you have the backup jobs setup correctly, they only take a few seconds to run after the initial sync. It's not minute by minute sync or anything though. Create a PBS on each site and either A. Backup the VMs directly to the other sites PBS or B. Backup to the local PBS and have pull or push job to send it to the other sites PBS.

Then you can just pull the VM up from a backup on the PBS. If you use a local PBS server you can actually just do a live-restore, so you can run the VMs directly off the backup storage and it'll migrate silently in the background back to prod storage.

There's no automation for this, but it would be significantly simpler to automate compared to a home grown ZFS send type of deal.
 
  • Like
Reactions: UdoB
Maybe utilize PBS in this case. If you have the backup jobs setup correctly, they only take a few seconds to run after the initial sync. It's not minute by minute sync or anything though. Create a PBS on each site and either A. Backup the VMs directly to the other sites PBS or B. Backup to the local PBS and have pull or push job to send it to the other sites PBS.
Yes, I currently work this way. I back up twice a day and sync the local PBS to the remote PBS. The problem with this is that it would take at least 4 to 6 hours (optimistically) to restore the main VMs on the other side so I could migrate the routes/DNS and continue working.

That's why I was thinking about something like the pve-zsync mentioned by the friend above, where I would incrementally clone the VM back and forth.
 
  • Like
Reactions: guletz
Hi @Tacioandrade ,

Take a look at pve-zsync. It can do zfs replication(is incremental) from host A to host B, for any VM/CT.(apt install pve-zsync).

THEN creat on destination B the corepondent config/definition of yours replicated VM/CT.

https://pve.proxmox.com/wiki/PVE-zsync

Good luck / Bafta !
Yes, pve-zsync seems like the best option, so much so that I mentioned it in the initial post as an option for using scripts.

The problem with this would be that other users on the system would need to have greater knowledge of Shell to perform these tasks. An N2 user, for example, wouldn't be able to add a replication task for a newly created VM or analyze the logs for potential issues, leaving me to debug in case of problems.
 
Hi Tacioandrade,

if you look for solution involving PVE Clusters with ceph enabled.
Maybe a snapshot based ceph mirror could be an option. https://pve.proxmox.com/wiki/Ceph_RBD_Mirroring

We are testing the journal based mirror for the customer at the moment and it seems to work fine.
One thing which should also be synced are the <vmid>.conf-files in /etc/pve/nodes/<clusternode_name>/qemu-server/
Otherwise you end up with just having the disks and not the vm configuration synced.

You can try to archive active-passive behavior with disabling the autostart options in the vm configuration and configure HA in the Datacenter section. As long you are not syncing /etc/pve/ha/resources.cfg ,
the guests in the target cluster should be stopped.

BR, Lucas

PS: I am not sure, if this is a proper way to encounter compromise, but that might also depend on the environment and use case. :)