Proxmox & Ceph layout

Sarandyna

New Member
Aug 5, 2024
5
0
1
We have two data centers (streched cluster) with 4 servers each in a VMware vsan config ( with Raid 1 -mirroring) of each with 20 disks / 60TB total capacity ( RAW). We would like to convert this cluster into proxmox with ceph as a storage.

I am listing out only the current VSAN storage layout as we have enough power for the compute.

(Figure's shown in Vcenter)
Total Useable capacity in VSAN : 454TB
VM Number: 150
Provisioned capacity for VM's : 130 TB ( total VM disk size)
FTT: raid1 (across datacenter)

Our redundancy goal in Proxmox: "Even if we lose one data center, we still want to tolerate some disk failures in the remaining DC without DU/DL"

I am not an expert in ceph, would like to get some opinion on what would be best storage layout for this requirement ?
 
Hi all,

also not a Ceph expert, but how about this:

You create an erasure coded pool with a 2+4 scheme, then distribute 3 chunks to each data center. Then you should survive the loss of one DC (3 chunks left), and still you would survive the loss of one more chunk: 2 chunks left, and because k = 2, you still have not data loss yet.

Could someone confirm whether this works?
 
What happens in a split brain scenario? Are you looking as one cluster as the backup of the other (active/passive) or do you need active/active. Are you looking to do this at the application level or let your hypervisor deal with HA? What is your RPO/RTO?

What would happen in your current situation? Seems like just RAID1 across two datacenters is a recipe for disaster. Have you tested/survived any outages?

Ceph has multi-site replication
https://docs.ceph.com/en/quincy/radosgw/multisite/ - you can make one big cluster depending on the latency between the sites and the above availability requirements.

There are a lot of parameters that goes into proper design, seems your vSAN may have some poor assumptions as well. For proper HA you need a third site as a witness at least.
 
In the current setup we have a VMware witness node ( on a third site )to manage the stretched cluster. It's an active & active setup , HA was managed at hypervisor level (ESX). PFTT as none (Streched cluster) and SFTT is raid 1 (mirroring) in the VSAN storage policy, currently it can survive a datacenter failure , but we would like to improve this when moving to proxmox by adding an additional failure scenario of a disk or two disk failure at the surviving datacenter when one datacenter was down.
 
Presuming your latency is low enough (or you don't care about latency of disk writes) the equivalent setup + extra redundancy would be to set up Ceph with a 4-way mirror while setting your failure domains at the datacenter level (not sure if you have rack-level redundancies as well, but that can be encoded in the future). Obviously you need a witness node in the third datacenter (or cloud), that way Ceph will always remain 'up and running'.

Once there is a failure in the other datacenter, the remaining 2 copies will rebuild a 4-way mirror, so you will have the opportunity for up to 1 node to go down in the remaining datacenter until the system is rebuilt, at which point 3/4 nodes can go down.

Proxmox can span multiple datacenters provided your latencies don't exceed the timeout on the heartbeat.
 
In the current setup we have a VMware witness node ( on a third site )to manage the stretched cluster. It's an active & active setup , HA was managed at hypervisor level (ESX).
It is theoretically possible to create something similar with pve, but its not built in. Also, corosync is latency sensitive so unless you have a guaranteed low latency interDC link to all three locations (including the witness node) this would cause more problems then it solves.

The way I would approach this problem is by moving the HA component further up the stack- are your applications multihomable using dns load balancing? could they be made so?
 
@alexskysilk: He currently has basically a cluster that spans over a WAN with no redundancy, if the link between the DCs fails, he basically breaks all his mirrors and relies on a single disk.

You can manage HA at the hypervisor level, there is an option right in the GUI, you just need to specify where each node (or sets of node) should be moved to and it will do that.

As long as the latency between the DC is under 1s (which I'm assuming otherwise his current setup would be glacially slow) but even that can be tuned. For his setup with 8 nodes, the final token timeout is 4900ms (that is the theoretical maximum before your cluster is marked as 'failed'). I'm not sure what Ceph's default timeout is, but as long as you never reach >1s from reading on side one to writing and processing on side two, you should be safe. I've had issues with hardware in the past that caused 1-2s delays before the token got processed on the same LAN.

Note that, like the current setup, that latency translates on your disk performance. Also be sure you have at least some dedicated bandwidth between the systems for the Corosync stuff (QoS or VxLAN etc).
 
Last edited:
You can manage HA at the hypervisor level, there is an option right in the GUI for HA, you just need to specify where each node (or sets of node) should be moved to and it will do that.
Are you referring to pve? can you link documentation? I have not seen any provision for layered clusters but I'd be the first to admit I dont always read the documentation all the way through ;)
 
https://pve.proxmox.com/wiki/High_Availability

If you have two sets of data storage (active/passive setup), you can set it to replicate between two nodes for example every 15 minutes. This can be done in the same cluster as well. For Active/Active as OP wants, they need to be in the same cluster/Ceph pool though which can be challenging depending on bandwidth and latency between sites. Ceph itself has active/passive options too if latency is too large.

Another option (the way we do it) would be to use Proxmox Backup Server in a secondary datacenter because complete datacenter failure is very unlikely. When the datacenter has been marked as failed by our monitoring system, we spin up "the cloud" and live-boot the recovery for the systems that can survive a 1h delay between backup and recovery. For others (databases) we use application-level (eg. PostgreSQL has Barman) site-to-site replication. It is a bit more complex to set up (writing Ansible playbooks etc) but it does not require us to invest in 21+ nodes with GPU for the once-in-a-lifetime disaster.
 
Oh I see what you mean.

I read the original request as to have an active multisite cluster with multiple failure domains. Yes, you can do what you suggest but it would require manual intervention; that is fine for a disaster recovery approach but not for high availability.
For Active/Active as OP wants, they need to be in the same cluster/Ceph pool though which can be challenging depending on bandwidth and latency between sites.
That is only half the problem. On link disruption you have a split brain situation. If you have equal number of nodes on both facilities, both sides end up fenced, and if you have unequal nodes you may end up with the larger side down causing complete outage.
 
Correct, hence the point on having a witness node (which is a simple third node that just does Ceph/Corosync stuff).
 
@alexskysilk: He currently has basically a cluster that spans over a WAN with no redundancy, if the link between the DCs fails, he basically breaks all his mirrors and relies on a single disk.

You can manage HA at the hypervisor level, there is an option right in the GUI, you just need to specify where each node (or sets of node) should be moved to and it will do that.

As long as the latency between the DC is under 1s (which I'm assuming otherwise his current setup would be glacially slow) but even that can be tuned. For his setup with 8 nodes, the final token timeout is 4900ms (that is the theoretical maximum before your cluster is marked as 'failed'). I'm not sure what Ceph's default timeout is, but as long as you never reach >1s from reading on side one to writing and processing on side two, you should be safe. I've had issues with hardware in the past that caused 1-2s delays before the token got processed on the same LAN.

Note that, like the current setup, that latency translates on your disk performance. Also be sure you have at least some dedicated bandwidth between the systems for the Corosync stuff (QoS or VxLAN etc).
Yeah, the lattency between DC is less than 5 ms, literally it's not too far
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!