[SOLVED] Cluster Planning: Stability if Half the Cluster Servers (Planned) Fail

Asano

Well-Known Member
Jan 27, 2018
57
10
48
44
I'm currently planning a new Proxmox installation for an organization with two sites in different countries. The sites are connected with a private backbone and thus for the means of disaster recovery I want to be able to fail over all services to one site and be able to bring them up with the other site remaining offline. If in such a disaster there is data loss of few minutes and the fail over will take some time (even a few days) and manual labor that is no problem. So true HA and thus HA storage is not required and zfs with pve-zsync or storage replication would be sufficient (though maybe we still will opt for ceph but that is irrelevant for this topic).

The question now is, is it better to have A) one Proxmox cluster with an equal amount of servers on each site or B) have two entirely separate Proxmox clusters on each site? And is there best practice/recommended approach?

Personally I'd like one cluster more since it is less maintenance and documentation. However I'm not sure how good a Proxmox cluster would function without quorum in the event of disaster and what pitfalls there might be. With no ture HA there also is no fencing which should make it easier but for example I remember in the earlier implementations of Proxmox 2FA you couldn't log in anymore into an Proxmox server which was part of cluster which lost quorum. This would be very relevant for this installation as well as SSH will be blocked. So is this still an issue and are there other known similar issues?

Thanks for any insights!
 
Separate clusters because they cannot deal with high latency between nodes (and you probably depend on other parties for the connection between countries).
Why not run PBS on both sides (which can backup running VMs locally very quickly) and let them sync with each other. Then you can easily restore a relatively recent version of each VM on the other side if you need to. Or switch each weekend between sites, to make sure everything still works and DR is not something rare and hardly tested?

EDIT: I have no experience with this, so feel free to tell me why I'm wrong about this.
 
Last edited:
  • Like
Reactions: ucholak
Hi @Asano, sounds like you have good foundation to start working from.

The question now is, is it better to have A) one Proxmox cluster with an equal amount of servers on each site or B) have two entirely separate Proxmox clusters on each site? And is there best practice/recommended approach?
If your choice of replication is ZFS, then the nodes have to be in the single cluster. There is no cross-cluster replication, yet. There is remote-migrate (beta), but its not quiet what you want here.

If you split your nodes into equal parts, then you guarantee a split-brain situation down the road. The "trick" with cross-site clusters is to have the "vote" in the 3rd location. In that case only the double-failure will cause an outage (a site plus the link to the vote). Planning to survive a double-failure is a difficult task.

As @leesteken said, you should go with two isolated clusters. That removes inter-dependency between sites. Since you dont have tight RTO or RPO - the PBS backup/replication seems to be the best approach.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It's literally designed NOT TO WORK in that scenario:

https://forum.proxmox.com/threads/high-latency-clusters.141098/
Thanks for the link. I did know it was not designed for it but what I didn't know was the actual pitfalls (which I asked for) and the one @fabian mentioned in that thread, that the "amount of time [for /etc/pve syncs] is hard-coded everywhere" (ufff :p) alone is severe enough to not think any further of a cluster spanning over two sites.

As said storage is not the topic here but regardless the choice of ceph, zfs or PBS there are good tools and strategies for near real time cluster to cluster sync (though PBS surely would be the worst choice for that use case). So that is really no issue.
 
Last edited:
Thanks for the link. I did know it was not designed for it but what I didn't know was the actual pitfalls (which I asked for) and the one @fabian mentioned in that thread, that the "amount of time [for /etc/pve syncs] is hard-coded everywhere" (ufff :p) alone is severe enough to not think any further of a cluster spanning over two sites.

Now that you namedropped him, I just want to add a disclaimer: I have no problem with the design, just when I asked that question myself, I was confused by the "not designed to" as opposed to "designed not to" (makes a difference to me). And the other thing is, I like to be precise in what is the element that relies on the low latency, it is not corosync per se, it is pmxcfs and the choices made there. Now I have no opinion what other choices might have brought because I also undestand that having extended virtual synchrony between nodes that have IO blocked for 30 seconds is ... not workable.
 
And what's wrong with own zfs send | receive?

Just want to add here - if you were doing this - to make it workable, it needs to be something like:

zfs send pool/dataset@snapshot | mbuffer -s 128k -m 512M | ssh remote_node zfs receive pool/dataset
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!