Replication and HA

May 24, 2022
137
17
23
Switzerland
Hello!


I need help clarifying a situation that I’m not sure how to manage!


Site A (in a datacenter)
I have a Proxmox server with local storage (local-lvm / local).
On this server, I have two production VMs that are always running.


Site B (in another datacenter)
I have another Proxmox server with the same local storage setup (local-lvm / local).
It has the same configuration as Site A.


Both sites are connected via VPN with a dedicated 10Gb fiber link.


I want to replicate my two VMs from Site A to Site B every 30 minutes, but they should remain powered off on Site B.


If Site B detects that Site A is down (due to a network failure or hardware failure), it should immediately power on the replicated VMs.


How can I achieve this?


  • Can this be managed with only two servers?
  • Do I need a third server elsewhere to handle quorum?
  • Is there a heartbeat function to detect failures?

Thanks in advance for your insights!
 
Hi @andaga ,

  • PVE, currently, has only one native replication method. It is based on having ZFS storage, not LVM.
  • To implement this replication the Source and the Target must be in the same PVE Cluster.
  • PVE cluster has a recommended requirement of 3 (odd) nodes in the cluster.
  • If you deploy PVE cluster across sites you must ensure that there is very low latency between the sites, otherwise the cluster will not be stable.
  • You should not have a majority of nodes in a single site, i.e. 2/1 split. With 3 nodes dispersed across sites, you need to place the 3rd node into independent site from node1 and node2.
  • If you were to implement all of the above then the HA subsystem will invoke VM failover if primary node is lost (i.e. node2 and node3 survive and can communicate).

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @andaga, as @bbgeek17 has very well described the 'classic' means of Clustering require usually a latency-minimized environment, usually LAN. Looking at the 'big' (Expensive) storage vendors, e.g. with Metro-Clusters you can reach to the limit of 'clustering' with the magic RTT threshold how far 'metro' can be defined. I remember also some 'magic' buffering tricks with HDS GAD to go to the real limits of physics. But if not needed, this scenarios (expensive) must have usually a really good business case behind it and VPN is reaaaaly not advised (as bbgeek17 already mentioned). So async replication is imo what this is about. (think hard about your RPO and RTO goals)

A / I would describe your Site B, as a "semi-cold DR site". I would think about what could be achieved with a HW-PBS on a 3rd site, with HW-PBS both on Site A and Site-B (and replicate via PBS mechanisms), or -for the braver ones- utilizing Virtual-PBS residing on your PVE itself (also SIte A, or SIte B, or both!). But this would only solve the data availiability part. I really think PVE / PBS is absolutely awesome, but this kind of features you ask for, Virtzilla had put Millions if not Billions of R&D money into it to solve in the last 20 years automagically. So you could indeed try to script (or let someone script) this "Observability" plus AUTO-Rampup, failure (beware of network glitches with VPN!). But there are a lot of caveats and testing you have to do, not accidentally having both SIte A and Site B up and in worst case RDBMS creating transactions simultanously (been there) -> therefore think about B.

B / The other option would be to utilize Backup & Restore COTS (async) replication. VEEAM (no affiliation) has very powerful (async) replication, fail-over and fail-back capabilities, where they spent also a lot of R&D money into it, but i'am unsure about the avaliability of this feature for PVE (Replication). Other COTS vendors e.g. ZERTO (no affiliation) have similiar async replication capabilities. Another option (the best one imo):

C/ Check whether your application supports (In-VM) Application Replication (breaking the "Cold" paradigm on Site B). ERP could have this scenarios, RDBMS might have this capability as well. Because its the application you want to be "HA", this is imo always the best way (and "Cheapest" IT ressource wise) to achieve resilience. But you would need application exprtise and/or an application landscape which supports this kind of capabilities, which why so often this topics is delegated to "Storage & Infra".

[Virtualization Support for SME and NGOs | DLI Solutions ]