Proxmox & Ceph layout

Sarandyna · Oct 7, 2024

We have two data centers (streched cluster) with 4 servers each in a VMware vsan config ( with Raid 1 -mirroring) of each with 20 disks / 60TB total capacity ( RAW). We would like to convert this cluster into proxmox with ceph as a storage.

I am listing out only the current VSAN storage layout as we have enough power for the compute.

(Figure's shown in Vcenter)
Total Useable capacity in VSAN : 454TB
VM Number: 150
Provisioned capacity for VM's : 130 TB ( total VM disk size)
FTT: raid1 (across datacenter)

Our redundancy goal in Proxmox: "Even if we lose one data center, we still want to tolerate some disk failures in the remaining DC without DU/DL"

I am not an expert in ceph, would like to get some opinion on what would be best storage layout for this requirement ?

Alphaphi · Oct 8, 2024

Hi all,

also not a Ceph expert, but how about this:

You create an erasure coded pool with a 2+4 scheme, then distribute 3 chunks to each data center. Then you should survive the loss of one DC (3 chunks left), and still you would survive the loss of one more chunk: 2 chunks left, and because k = 2, you still have not data loss yet.

Could someone confirm whether this works?

guruevi · Oct 8, 2024

What happens in a split brain scenario? Are you looking as one cluster as the backup of the other (active/passive) or do you need active/active. Are you looking to do this at the application level or let your hypervisor deal with HA? What is your RPO/RTO?

What would happen in your current situation? Seems like just RAID1 across two datacenters is a recipe for disaster. Have you tested/survived any outages?

Ceph has multi-site replication
https://docs.ceph.com/en/quincy/radosgw/multisite/ - you can make one big cluster depending on the latency between the sites and the above availability requirements.

There are a lot of parameters that goes into proper design, seems your vSAN may have some poor assumptions as well. For proper HA you need a third site as a witness at least.

Sarandyna · Oct 8, 2024

In the current setup we have a VMware witness node ( on a third site )to manage the stretched cluster. It's an active & active setup , HA was managed at hypervisor level (ESX). PFTT as none (Streched cluster) and SFTT is raid 1 (mirroring) in the VSAN storage policy, currently it can survive a datacenter failure , but we would like to improve this when moving to proxmox by adding an additional failure scenario of a disk or two disk failure at the surviving datacenter when one datacenter was down.

guruevi · Oct 8, 2024

Presuming your latency is low enough (or you don't care about latency of disk writes) the equivalent setup + extra redundancy would be to set up Ceph with a 4-way mirror while setting your failure domains at the datacenter level (not sure if you have rack-level redundancies as well, but that can be encoded in the future). Obviously you need a witness node in the third datacenter (or cloud), that way Ceph will always remain 'up and running'.

Once there is a failure in the other datacenter, the remaining 2 copies will rebuild a 4-way mirror, so you will have the opportunity for up to 1 node to go down in the remaining datacenter until the system is rebuilt, at which point 3/4 nodes can go down.

Proxmox can span multiple datacenters provided your latencies don't exceed the timeout on the heartbeat.

alexskysilk · Oct 8, 2024

Sarandyna said:
In the current setup we have a VMware witness node ( on a third site )to manage the stretched cluster. It's an active & active setup , HA was managed at hypervisor level (ESX).

It is theoretically possible to create something similar with pve, but its not built in. Also, corosync is latency sensitive so unless you have a guaranteed low latency interDC link to all three locations (including the witness node) this would cause more problems then it solves.

The way I would approach this problem is by moving the HA component further up the stack- are your applications multihomable using dns load balancing? could they be made so?

guruevi · Oct 8, 2024

@alexskysilk: He currently has basically a cluster that spans over a WAN with no redundancy, if the link between the DCs fails, he basically breaks all his mirrors and relies on a single disk.

You can manage HA at the hypervisor level, there is an option right in the GUI, you just need to specify where each node (or sets of node) should be moved to and it will do that.

As long as the latency between the DC is under 1s (which I'm assuming otherwise his current setup would be glacially slow) but even that can be tuned. For his setup with 8 nodes, the final token timeout is 4900ms (that is the theoretical maximum before your cluster is marked as 'failed'). I'm not sure what Ceph's default timeout is, but as long as you never reach >1s from reading on side one to writing and processing on side two, you should be safe. I've had issues with hardware in the past that caused 1-2s delays before the token got processed on the same LAN.

Note that, like the current setup, that latency translates on your disk performance. Also be sure you have at least some dedicated bandwidth between the systems for the Corosync stuff (QoS or VxLAN etc).

alexskysilk · Oct 8, 2024

guruevi said:
You can manage HA at the hypervisor level, there is an option right in the GUI for HA, you just need to specify where each node (or sets of node) should be moved to and it will do that.

Are you referring to pve? can you link documentation? I have not seen any provision for layered clusters but I'd be the first to admit I dont always read the documentation all the way through

guruevi · Oct 8, 2024

https://pve.proxmox.com/wiki/High_Availability

If you have two sets of data storage (active/passive setup), you can set it to replicate between two nodes for example every 15 minutes. This can be done in the same cluster as well. For Active/Active as OP wants, they need to be in the same cluster/Ceph pool though which can be challenging depending on bandwidth and latency between sites. Ceph itself has active/passive options too if latency is too large.

Another option (the way we do it) would be to use Proxmox Backup Server in a secondary datacenter because complete datacenter failure is very unlikely. When the datacenter has been marked as failed by our monitoring system, we spin up "the cloud" and live-boot the recovery for the systems that can survive a 1h delay between backup and recovery. For others (databases) we use application-level (eg. PostgreSQL has Barman) site-to-site replication. It is a bit more complex to set up (writing Ansible playbooks etc) but it does not require us to invest in 21+ nodes with GPU for the once-in-a-lifetime disaster.

alexskysilk · Oct 8, 2024

Oh I see what you mean.

I read the original request as to have an active multisite cluster with multiple failure domains. Yes, you can do what you suggest but it would require manual intervention; that is fine for a disaster recovery approach but not for high availability.

guruevi said:
For Active/Active as OP wants, they need to be in the same cluster/Ceph pool though which can be challenging depending on bandwidth and latency between sites.

That is only half the problem. On link disruption you have a split brain situation. If you have equal number of nodes on both facilities, both sides end up fenced, and if you have unequal nodes you may end up with the larger side down causing complete outage.

guruevi · Oct 8, 2024

Correct, hence the point on having a witness node (which is a simple third node that just does Ceph/Corosync stuff).

Sarandyna · Oct 10, 2024

guruevi said:
@alexskysilk: He currently has basically a cluster that spans over a WAN with no redundancy, if the link between the DCs fails, he basically breaks all his mirrors and relies on a single disk.

You can manage HA at the hypervisor level, there is an option right in the GUI, you just need to specify where each node (or sets of node) should be moved to and it will do that.

As long as the latency between the DC is under 1s (which I'm assuming otherwise his current setup would be glacially slow) but even that can be tuned. For his setup with 8 nodes, the final token timeout is 4900ms (that is the theoretical maximum before your cluster is marked as 'failed'). I'm not sure what Ceph's default timeout is, but as long as you never reach >1s from reading on side one to writing and processing on side two, you should be safe. I've had issues with hardware in the past that caused 1-2s delays before the token got processed on the same LAN.

Note that, like the current setup, that latency translates on your disk performance. Also be sure you have at least some dedicated bandwidth between the systems for the Corosync stuff (QoS or VxLAN etc).

Yeah, the lattency between DC is less than 5 ms, literally it's not too far

Search

Search

Proxmox & Ceph layout

Sarandyna

New Member

Alphaphi

New Member

guruevi

Well-Known Member

Sarandyna

New Member

guruevi

Well-Known Member

alexskysilk

Distinguished Member

guruevi

Well-Known Member

alexskysilk

Distinguished Member

guruevi

Well-Known Member

alexskysilk

Distinguished Member

guruevi

Well-Known Member

Sarandyna

New Member

We value your privacy