[SOLVED] How does Replication work in a HA environment?

Razva

Renowned Member
Dec 3, 2013
250
8
83
Romania
cncted.com
Let's take a simple scenario. VM is stored on Node1. A Replication rule is created, so the VM is replicated on Node2. The VM is HA protected. Replication is done once at 5 minutes

Node1 gets offline so the VM is moved to Node2 by HA. After 60 minutes Node1 gets back online and is accepted back into the cluster. VM remains on Node2 and is not manually migrated back to Node1.

Here are some questions:
- after Node1 is back into the cluster, and is left there, will data from Node1 continue to be replicated (replaced) on Node2? Because in this case the VM will be "continuously overwritten" with old data from Node1 (before the outage), which is a very bad idea;
- after Node1 is back into the cluster, is there any way for the VM to be automatically replicated from Node2 to Node1? Note that there isn't any initially created rule that Replicates the VM from Node2 to Node1 (obviously), but only from Node1 to Node2. So is Proxmox "smart enough" to "speak" with the HA services and automatically "reverse direction" for the Replication rule?
 
Last edited:
Please check out the documentation, it may already answer your questions.
High-Availability is allowed in combination with storage replication, but it has the following implications:
  • redistributing services after a more preferred node comes online will lead to errors.
  • recovery works, but there may be some data loss between the last synced time and the time a node failed.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvesr
 
Please check out the documentation, it may already answer your questions.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvesr
Thanks for the link, already read that. It states:
For example: VM100 is currently on nodeA and gets replicated to nodeB. You migrate it to nodeB, so now it gets automatically replicated back from nodeB to nodeA.
This scenario is talking about a manual migration. Does this apply to automatic failover/HA?
 
- after Node1 is back into the cluster, and is left there, will data from Node1 continue to be replicated (replaced) on Node2?
No the replication will stop, as long the VM will not moved back.
If you move the VM to the source node you have the override effect.
So carefully with ha priority and failback.
But this behavior will be fixed.

- after Node1 is back into the cluster, is there any way for the VM to be automatically replicated from Node2 to Node1?
Not for now.
You have to destroy the job and make a new job.

Side note: HA works not with 2 Nodes. So if you have in your scenario 2 nodes only, you have to take care for a 3 member.
This 3 member can a qdevice.
 
No the replication will stop, as long the VM will not moved back.
If you move the VM to the source node you have the override effect.
So carefully with ha priority and failback.
But this behavior will be fixed.
Great! Any way to check the status of this issue?

Not for now.
You have to destroy the job and make a new job.
Got it, thanks! Maybe fix this along the way when working at the previous ^^^ issue?

Side note: HA works not with 2 Nodes. So if you have in your scenario 2 nodes only, you have to take care for a 3 member. This 3 member can a qdevice.
Yes, the production cluster has 3 members. Thanks anyway for the headsup!
 
Can you specify exactly what you mean with "too often".
Don't know...1-2 minutes? I'm curious what happens if a Replication task wasn't able to finish and new ones are coming. Like...Replication process takes longer than the period between replication. I'm obviously not going to do it on purpose, but who knows.
 
The pvesr run will run every minute.
But if the pvesr is still running (job(s) not finished) it will skip the run.
So you have always one running job.
 
The pvesr run will run every minute.
But if the pvesr is still running (job(s) not finished) it will skip the run.
So you have always one running job.
Understood. So there's no chance of data corruption because of two replication tasks running at the same time or stuff like that, right?
 
Pvesr do serial replication.
The problem with parallel replication is the bandwidth limitation is very hard to handel.
 
This dependence on you nic bandwidth and wow much you write (change) on the images.
A static web server VM can replicate once a day, but a Mail Server should replicated every minute.
 
This dependence on you nic bandwidth and wow much you write (change) on the images.
A static web server VM can replicate once a day, but a Mail Server should replicated every minute.
I totally agree, but if I set 5 VMs to replicate at one minute...how will all 5 VMs get replicated, thinking at the fact that there's no parallel replication?

Let's take two scenarios, 5 VMs replicating each minute.

Scenario 1:
- minute 1 => VM1 replicated, VM2/3/4/5 not replicating because VM1 is first
- minute 2 => VM1 replicated, VM2/3/4/5 not replicating because VM1 is first
- minute 3 => VM1 replicated, VM2/3/4/5 not replicating because VM1 is first
[etc...]
Result: if 5 MVs are all set to be replicated each minute, only one VM will be replicated.

Scenario 1:
- minute 1 => VM1 replicated, VM2/3/4/5 in queue
- minute 2 => VM2 replicated, VM1 not replicated because it was replicated last minute, VM3/4/5 in queue
- minute 3 => VM3 replicated, VM1/2 not replicated because it was replicated last minute, VM4/5 in queue
- minute 4 => VM4 replicated, VM1/2/3 not replicated because it was replicated last minute, VM5 in queue
- minute 5 => VM5 replicated, VM1/2/3/4 not replicated because it was replicated last minute
- minute 6 => VM1 replicated, VM2/3/4/5 in queue
[etc...]
Result: if 5 MVs are all set to be replicated each minute, each VM will be replicated once at 5 minutes, even if it's set once per minute.

Which scenario is the current one?
 
You miss understood.
pvesr calculate first wich VM must now replicated.
The job is finished if all VM are replicate.
Then the next round start.

The problem is if you have a large VM what need 5 hours sync time no VM can synced in this time.
This should normally only happend on the initial sync.
If this happend on regular operation your network is to slow.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!