Cluster between DCs far away?

Deleted member 93625 · Oct 8, 2020

Hi all,

I'd like to ask a question regarding the scenario I made. The below is the connection diagram.

The physical distance between data centres is about 1000 km. In this environment,

All PVE nodes are in one cluster.
VMs will run across both data centres.
All VMs' disk images will be stored into a single shared storage (storage 1 in DC A).
Storage will be replicated to other one (storage 2 in DC B).

My plan is using a storage network (Red - 10G single line) for VM migration and storage replication. All other traffics, e.g. private network between VMs, cluster traffic between PVE nodes are using a different network (Green - 100M currently). Those lines are dedicated leased lines.

Assuming the routing for WAN is properly done,

Will my model work okay? Someone said it won't due to the geographical reason - the distance is too far and it will have a huge latency. With ping test, the RTT is around 11, 12 ms between the DCs at the moment.
I am thinking of building a ZFS system for storage (ZFS on Linux) and using ZFS over iSCSI. Will it be okay? I think the cluster doesn't need to know about Storage 2, is this correct?
Suppose that ZFS is used for storage, is it possible to do incremental replication from DC A to DC B? It may not need to be continuous but say every 15 mins? Will it saturate the link easily? We are going to run quite huge amount of VMs (say, over 200).
Let's say, the DC A got a bomb. If I remember correctly, as long as I backed up all VMs configuration and replicated storage's LUN is the same, I can hook the Storage 2 with the same LUN into the cluster and run VMs on DC B. Is this correct?

Hope I explained this well. Thanks very much.

Eoin

t.lamprecht · Oct 8, 2020

eoinkim said:
Will my model work okay? Someone said it won't due to the geographical reason - the distance is too far and it will have a huge latency. With ping test, the RTT is around 11, 12 ms between the DCs at the moment.

You're really at the limit of latency which could work.
While we recommend LAN like latencies with <= 2 ms, we know that a stable network (no latency spikes) can also run with a bit higher latencies. Up to 8 can be OK, as long as there are really no spikes. I know of some people which run it at 10-12 ms, so in your range, and say it works for them, but often that are then just two or three nodes clusters.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements

That said, if you want to evaluate it I'd recommended the following:

use multiple corosync knet links, with the one which has the most stable latencies as "primary" one. That normally means that any link which hosts IO/storage traffic is not the right one, as there will be latency spikes for sure.
You run an even node count, if the network between the two DC is overloaded both sides will be unquorate. So, I'd add an external QDevice on the outside, this acts as vote arbitrator and can help if the link between DCs is overloaded or dead
It's much simpler and has less requirements for network, see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
You may go further by fine tuning some corosync parameters, check the corosync.conf manpage and talk with the developers and us in the kronosnet #IRC or the cluster-lab mailinglist

eoinkim said:
I am thinking of building a ZFS system for storage (ZFS on Linux) and using ZFS over iSCSI. Will it be okay? I think the cluster doesn't need to know about Storage 2, is this correct?

Can be OK, depends on your uses case. And no, they do not need to directly access the ZFS on the other nodes.

eoinkim said:
Suppose that ZFS is used for storage, is it possible to do incremental replication from DC A to DC B? It may not need to be continuous but say every 15 mins? Will it saturate the link easily? We are going to run quite huge amount of VMs (say, over 200).

It's possible, initial replication may take a bit, the following ones are incremental and should not be an issue.
Note, it could be more interesting for you to run two ceph clusters, one on each side, with the other one setup as replication target. This way you have a single unified shared storage per DC, which makes things easier there and still can do replication to the other DC. Plus, Ceph is a bit easier to expand and scale in all directions.

Another option for your setup could be our deduplicated, incremental and fast Proxmox Backup Server, you can do cheap remote syncs there. https://pbs.proxmox.com/docs/introduction.html#main-features

It's still in beta, but we're working hard to get it released as stable and got already quite some positive feedback.

eoinkim said:
Let's say, the DC A got a bomb. If I remember correctly, as long as I backed up all VMs configuration and replicated storage's LUN is the same, I can hook the Storage 2 with the same LUN into the cluster and run VMs on DC B. Is this correct?

Yes, you should be able to do that. But should also definitively test it before any explosion, to know the steps required and have a somewhat routine or document which helps in a, normally very stressful, event of a DC take down incident.

Deleted member 93625 · Oct 8, 2020

@t.lamprecht Thanks for your detailed response.

It sounds like the latency is really critical. I am not familiar with Ceph storage yet so I may have to do some study for that.

So, if I am really concerned with latency, probably it's better to separate them into two clusters by data centre and make storage replication with Ceph somehow? Is this what you were saying?

Thanks again.

Eoin

t.lamprecht · Oct 8, 2020

eoinkim said:
So, if I am really concerned with latency, probably it's better to separate them into two clusters by data centre and make storage replication with Ceph somehow? Is this what you were saying?

Would lessen the coupling and make things a bit more stable on each side for sure, IMO.

The main downside is that you lose the unified management view, and so also live migration between two DCs. For the rest I'd say that having two browser tabs open isn't so bad compared to fine-tuning latency sensitive network links

Search

Search

Cluster between DCs far away?

Deleted member 93625

Guest

t.lamprecht

Proxmox Staff Member

Deleted member 93625

Guest

t.lamprecht

Proxmox Staff Member

We value your privacy