Ceph Nested fault domains?

TSAN

New Member
Jan 31, 2024
20
2
3
Broadcom is here. I began exploring Proxmox + Ceph yesterday. Got a nested 3 node spun up and now diving into Ceph.

Does Ceph offer a protection scheme similar to VMware VSAN 2 host mirroring (Nested fault domains) ?

If not familiar - with VSAN 2/3 node cluster having 3 disk groups within a storage policy you can mirror data across 2 separate disks/groups + 2 hosts. Usable space does get a 3x overhead (100GB = 400GB) but makes data very available. Total host failure + disk group failure on remaining host. Its primarily for VSAN 2 nodes but know that Ceph wants minimum of 3 hosts.

Watching few videos thus far is appears Ceph model / crush rules is OSD or Host as protection strategy. But not both together.
I have not done much digging into Ceph yet but thought I'd ask if this is possible. Thanks.
 
Ceph is quite flexible in how it stores data in a redundant way, but doing it in a two node cluster goes against any design principles.

If you are looking into getting some HCI storage into a 2-node cluster, look into local ZFS storage + Replication. It can be combined with HA.

We recommend ZFS pools that are using mirrored VDEVs for a VM workload. You can add multiple to a pool to get more space and performance (zpool add {pool} mirror /dev/disk/by-id/{disk1} /dev/disk/by-id/{disk2}). The result is similar to a RAID-10.

If you need more protection against disk failures and can stomach the additional cost, no one is saying that a mirror must be only 2 disks ;-)
You can do 3 or 4 way mirrors as well.

The only downside is that the VM replication is async. The shortest possible interval between replication runs is currently 1 minute. So in a worst-case scenario, should a node die and the HA guests be recovered by the second node, you might have a little bit of data loss: since the last successful replication.

Overall, for 2-node clusters, keep in mind that Proxmox VE clusters work by quorum (majority). Therefore, the smallest number of votes should be 3. The QDevice mechanism can be used to add another vote to 2-node clusters. The external part can serve multiple Proxmox VE clusters with a vote, and the latency requirements aren't as tight as with Corosync between the Proxmox VE nodes.
 
Last edited:
  • Like
Reactions: TSAN
Thanks for info and links. Yes I'm also late to ZFS party. ZFS on the agenda along with Proxmox + Ceph and then also visiting HyperV + Starwind.

I'll likely stick with 3 node cluster minimum for Proxmox. 2 node VSAN with witness became interesting to me after the introduction of nested fault domain. (Otherwise not an ideal IMO unless combined with a replica solution to like veeam.) Thought maybe Ceph might support 3 host w/ 2 OSD redundancy kinda like the nested fault domain.

Aside from the ZFS 1 min. async. My other concern would be that its only a isolated replica for failover. So cluster only gets throughput of the primary ZFS pool.
With VSAN one could disable "site read locality" setting and regardless of which host a VM is running on IO reads would happen across all disk groups / hosts. I don't know how Ceph does any load balancing yet but that's why I first figured I would explore Ceph as a HCI solution prior to ZFS. (Along with it being sync)

The other concern (learning topic) I have for Proxmox is vdisk resiliency against corruption during power / server failures when HBA disk cache enabled. VMFS6 / vmdks have proven themselves to me over the years. Also backup a big topic as well - yet to explore Proxmox backup solution. DB crash consistency / VSS / Snapshots. I read Veeam has or is considering investigating support for Proxmox. I hope this develops.
 
Thought maybe Ceph might support 3 host w/ 2 OSD redundancy kinda like the nested fault domain.
I could imagine that building something like this could be done, but you would definitely be customizing your setup a lot. And that is something you probably want to avoid as it makes it harder to predict the behavior in case something is failing. Also, keep in mind that Ceph will be actively recovering a lost replica if it has options to store it on the remaining nodes. Therefore, in a small 3-node cluster, consider to have smaller but more OSDs. At least 4 would be good. Otherwise, the loss of a single disk/OSD could fill up the remaining ones.

From a power-loss resiliency perspective, ZFS is copy on write. The way data is written in ZFS means that the final commit is atomic. Therefore, chances for corruption are very low. The slides in this blog post show how it works.
Ceph with the 3 replicas will let you know if one replica is problematic (checksum doesn't match).

Regarding performance, between Ceph & ZFS, keep in mind that in the default settings, Ceph might also read data from a different node -> higher latency (can be changed to read from the closest replica), while ZFS is local. With mirrored VDEVs, reading is a lot faster than writing, as reading can be spread over all the disks in the mirror.
I cannot say right now what will be faster for reading without benchmarking it. But my guess is, it will be ZFS, especially if you use mirrors with more than 2 disks.

when HBA disk cache enabled
Does it have a battery backup unit (BBU)? Sync writes should only be ACKed by the storage layer once the data is stored in a way where immediate power loss will not lead to data loss.
Ceph writes pretty much all data in sync mode. On ZFS it depends on how the application writes its data. Since DBs will issue sync writes most of the time, their data should be safe as well once ACKed.

For the sake of being thorough: Use good datacenter/enterprise SSDs with power loss protection (PLP) to get good performance with sync writes. No consumer SSDs :)
 
Last edited:
  • Like
Reactions: TSAN
Thanks. I agree, no interest in deviating from the Ceph common/supported deployments.

I have yet to dig into ZFS or Ceph how it works in regards to disk spans (+ Ceph fault domains). RAID10 / RAID 60 equivalents.
And I don't know how ZFS replication works if the exact same disk layout needs to be present on replica host. Or if its only capacity requirement.

If it is the same disk layout on ZFS replica, its interesting you feel ZFS is the higher performant. Say RAID10 6 disk spans - 12 disks on both ZFS primary + Replica the ZFS would perform better only using the local ZFS 12 disk pool as opposed to Ceph having access to the 24 disks. It makes sense that Ceph writes would take a penalty but I would (ignorantly) think reads would be better on Ceph provided its public / cluster network is on dedicate 10+ gbps links.

My understanding of most RAID cards / HBAs is that BBU is typically used on the hardware raid side. Write-back caching on virtual disk which then uses memory of RAID card with is protected by BBU. But typically one would have disk cache itself is disabled. (HDDs)

But when using a HBA / RAID card in passthrough mode the BBU is not in use. One chooses to enable/disable disk controller caching for HDDs which then really only protected by the servers' UPSs. SSDs having PLP changed some of this concern.

So for example a subject I want to dig into is the topic of ZFS + Ceph DB/ WAL caching journal disks on high performing SSDs with PLP. So HDDs with disk cache enabled and power loss (shouldn't happen) the data integrity is still protected that the journal disk didn't get a ACK from the HDD yet when power was lost. Upon coming online the PLP SSD journal disk confirms HDD has the data prior to deleting/releasing it.

All these software defined storage solution only want direct disk HBA and not HW RAID. So HBA BBU still valuable?

I've primarily been hardware raid, physical SAN, VMware VSAN guy. And for last 15+ years exclusive VMware. So its time to dig into other software defined storage + Hypervisors. All on the agenda dive head first this year and I have a rather large VMware home environment to learn on nested deployments.

First thing I attempted in this nested lab was installing Ceph dashboard to help expedite my understanding of Ceph disk / fault domains concepts but it appears their is a fix in progress for the dashboard. Fingers crossed soon.
"Module 'dashboard' has failed dependency: PyO3 modules may only be initialized once per interpreter process"
https://forum.proxmox.com/threads/ceph-warning-post-upgrade-to-v8.129371/page-6
 
Last edited:
I have yet to dig into ZFS or Ceph how it works in regards to disk spans (+ Ceph fault domains). RAID10 / RAID 60 equivalents.
Try not to think and map everything to classical RAID levels. Especially for Ceph since it works very different in the way it provides redundancy for the stored data.

And I don't know how ZFS replication works if the exact same disk layout needs to be present on replica host. Or if its only capacity requirement.
The VM replication is using the ZFS send/recv functionality. The only requirement from Proxmox VE is that the storage and ZFS pool are called the same on the nodes. The underlying physical layout can be different, but I don't recommend it.

For Ceph NICs, 10 Gbit can quickly become a bottleneck. A 25 Gbit (lower latency than 10 Gbit) or even a 100 Gbit network will prevent the network from being the bottleneck. If you haven't seen it, we did release a new benchmark whitepaper, investigating how different network setups affect the overall performance. https://forum.proxmox.com/threads/p...eds-in-a-proxmox-ve-ceph-reef-cluster.137964/

So for example a subject I want to dig into is the topic of ZFS + Ceph DB/ WAL caching journal disks on high performing SSDs with PLP
From my experience, I would avoid a dedicated DB/WAL disk and use good SSDs with PLP for the OSD and be good with it. DB/WAL disks can be a bandaid if HDDs are used for the OSDs, but will not magically improve performance to great levels.

All these software defined storage solution only want direct disk HBA and not HW RAID. So HBA BBU still valuable?
With ZFS and Ceph, the disks should be made available to the host as directly as possible. Ideally either directly connected (NVME via PCI connections) or via an HBA. If the HBA is flashed to IT mode, you are golden.

I've primarily been hardware raid, physical SAN, VMware VSAN guy. And for last 15+ years exclusive VMware. So its time to dig into other software defined storage + Hypervisors. All on the agenda dive head first this year and I have a rather large VMware home environment to learn on nested deployments.
In order to just test functionality, setting up virtual Proxmox VE nodes/clusters with many vdisks attached is a good start. With that you can test how things work and more important, how they fail and recover :)

For performance comparisons, you would need real hardware to get meaningful results.

If you want to dive deeper into Ceph, take a look at the CLI tools and explore. The ceph tool is to manage the Ceph cluster itself on a higher level. The rados tool to interact with it on the object layer and the rbd tool to manage it on the RBD (block devices for disk images) layer.

The following blog article (yes, shameless self plug ;) ) might be interesting as well to understand better how RBD is storing all the (meta)data in the rados object layer of Ceph.
 
  • Like
Reactions: TSAN and ucholak
The latter is news to me. How do you change that?
Ah, you called me out on it so I had to look into it in more detail as I only remembered reading about that config option in the docs some time ago. I did remember it somewhat wrong. It is not a setting that can be done in the Ceph config, but needs to be configured on the client when opening an RBD image. So this would need to be handled by the Proxmox VE tooling when starting up a guest with RBD images. Definitely something that could be looked into to make it an option to see if it works as expected and provides considerably faster reading speeds.

https://docs.ceph.com/en/reef/rbd/rbd-config-ref/#confval-rbd_read_from_replica_policy
 
Ah, you called me out on it so I had to look into it in more detail as I only remembered reading about that config option in the docs some time ago. I did remember it somewhat wrong. It is not a setting that can be done in the Ceph config, but needs to be configured on the client when opening an RBD image. So this would need to be handled by the Proxmox VE tooling when starting up a guest with RBD images. Definitely something that could be looked into to make it an option to see if it works as expected and provides considerably faster reading speeds.

https://docs.ceph.com/en/reef/rbd/rbd-config-ref/#confval-rbd_read_from_replica_policy
Any news on this? I've stumbled across this just now and wonder how to set this on PVE.
 
wonder how to set this on PVE.
to answer my own question ...

Code:
root@proxmox ~ > rbd config image list ceph/vm-100-disk-0 | grep rbd_read_from_replica_policy
rbd_read_from_replica_policy                 default      config

root@proxmox ~ > rbd config image set ceph/vm-100-disk-0 rbd_read_from_replica_policy localize

root@proxmox ~ > rbd config image list ceph/vm-100-disk-0 | grep rbd_read_from_replica_policy
rbd_read_from_replica_policy                 localize     image

Yet it does not make that much difference as I thought.
 
Any news on this? I've stumbled across this just now and wonder how to set this on PVE.
Thanks for bringing this up again. No, I haven't gotten around to look more into that setting.

Yet it does not make that much difference as I thought.
Do you have any rough numbers? Also, to put them into context, the disks and network speed would be nice to know :)
 
Thanks for bringing this up again. No, I haven't gotten around to look more into that setting.

Do you have any rough numbers? Also, to put them into context, the disks and network speed would be nice to know :)
We're still in the evaluation phase, it's not in its final setup and I don't have a good, systematic test approach yet, just "wild testing" and changing one parameter at the time.

It's a minimal (as in cost) reef CEPH cluster on PVE 8.3 with 3 Dell R6615 with 2x Dell-branded 960 GB NVMe PCIe 4.0, 10 GBE mesh network FRR for CEPH and we're operating on the limit of the network, yet not on the disk. Raw throughput per disk is 7,9 GB/s as you would expect from PCIe 4.0 NVMe. We tested everything with CrystalDisk on Windows as a VM (customers wish and sequential throughput is important, therefore any numbers are just this) and with a single disk ZFS (primarycache=metadata) and also LVM as a baseline. Both yield good values as you would suspect. ZFS without ARC caching is about 3,5 GB/s, with ARC almost on par with LVM at 7,6 GB/s.

With the option default, one can see that the CEPH network is used about 1.6 GB/s (both links of the mesh) and with the localize option, it's only 200 MB/s. The option balance shows higher network throughput of about 1.8 GB/s and the disk throughput increases from 2,3 (default) to 2,6 (balance) to 3,1 GB/s (localize). The option seems to be online changeable and you see the network traffic change immediately and disk throughput raises

We still try to optimize OSDs and number of PGs, yet we do not really see a clear pattern what would be best. Best 4K random write has the worst sequential throughput and vise versa.

On another 3-node production cluster with local SATA SSDs, the increase is more visibile, yet the numbers need to be taken with a grain of salt due to normal workloads interferring with the test, yet one can see that the sequential read numbers are 20% better (default top, localized bottom, balance was the same as default).

1732877645594.png

So this could be a big speed improvement for 3-node clusters.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!