Minimal Ceph Cluster (2x Compute Nodes and 1x Witness -- possible?)

user73937393 · Feb 2, 2024

For my testing/development environment, I am looking to configure a cluster of Proxmox servers using a Ceph storage pool so I can implement HA. My goals are high availability and having the ability to bring down a host for maintenance and patching. Also, to handle the occasional fault.

I promise I've spent an hour or more looking on the forums and the internet for an answer on this before posting to this forum. From what I found, there is a lot of information on the forums discussing a 3-node vs 2-node cluster. However, I was hoping we could delve one level deeper as I'm less concerned with the number of nodes in the cluster and more interested in knowing the function each node has to play in the cluster. My ideal configuration would be to utilise my existing 2x high-powered computer nodes (i.e. 256GB RAM, AMD 24-core CPU and several TB of storage). Then, have a low-powered Lenovo mini PC as the third/"witness" PC to complete the Proxmox/Ceph cluster requirements. The Lenovo PC would not host any VMs or significant storage (other than what's onboard the PC). It would simply function as the third-party witness for quorum reasons to Proxmox and Ceph.

Does anyone have this sort of configuration running in their environment? Is this possible? I am desperately trying not to purchase a third similarly-sized compute node (e.g., 265GB and AMD 24-core CPU) with matching storage.

sb-jw · Feb 2, 2024

With Replica 2 you are putting your data at great risk. Replica 2 is basically like a RAID 1, as long as both nodes are running everything is fine. If a node is currently offline for maintenance, everything is still good, but beware the other node has a problem. If even one disk breaks, your data is irretrievably destroyed.
In this setup, CEPH can only handle the failure of one disk. However, the other node must also be available and the cluster must be healthy. That's why you usually take replica 3 and have three nodes, then one can be offline for maintenance and one of them can have a disk failure. Then there is still one replication and CEPH can start healing immediately. With Replica 2, CEPH no longer has any option, one copy is gone due to maintenance and the other copy is smoked with the disk - so there is no longer a copy and the CEPH can no longer heal itself.

Especially if you install similar disks that you all got from one dealer and that all have the same runtime, it can happen that one disk per node fails. Depending on the size of the OSD, this can take several hours or days, during which time you are no longer protected against another failure.

But there is also scrubbing, for example, which tries to preserve the integrity of the data. If you have two blocks that have a different checksum, which one is the right one? If you have replica 3, there is a high probability that only the defective block has a different checksum. Here CEPH has the majority and can overwrite the defective block again.

Basically, what you want works, but I simply can't recommend it to you at all and I personally don't get involved in the creation of such a setup. I feel it is negligent to give you instructions for a disaster, I am committed to helping and doing it properly and correctly.

The illustration above is very simple and superficial, but it may give you a few approaches to understanding what the problem is here.

user73937393 · Feb 2, 2024

Thank you so much for the detailed response. This is excellent!

The main expense of the third node is purchasing matching CPU and RAM (e.g., 265GB and AMD 24-core CPU). What would your reaction be if someone proposed that the third node be less beefy but have a similar amount of storage? For example, an AMD 8+ core and 16GB RAM, but comparable storage to the compute nodes?

That would effectively mean that the third node would not host VMs. But, instead, be the 3rd node for CEPH and Proxmox cluster requirements.

alyarb · Feb 2, 2024

if you don't host VMs on it then you don't need as much CPU for just Ceph. but I think most people agree clusters are better when the nodes are all identical.

sb-jw · Feb 2, 2024

I can agree with @alyarb.
There is nothing wrong with using the third node only as a storage node and not providing any computing resources with it. You should just take this into account in your HA settings.

Alternatively, you could also consider whether you should not divide the resources across all nodes and, for example, only make 192 GB of RAM per node.

user73937393 · Feb 2, 2024

Alright. Thank you, gentlemen. It's hard to explain, but I have space and electricity limitations that won't allow me to put in another high-powered compute server of a similar size in the Proxmox cluster. Perhaps that could change one day when we move into another office space. So, for now, I will go with the lower-powered storage-only node approach.

Once I'm set up, I'll share my experience for everyone's benefit.

This might be a silly idea, but perhaps Proxmox should consider embedding Ceph support in the Proxmox Backup Server? It would be ideal if the backup server could participate in the Ceph cluster but not host VMs. If you think about it, a backup server would have fewer resources but tons of storage. So, you could have one datastore (separate and apart from Ceph) as storage for your backups. Then, a separate set of disks on the backup server participate in the Ceph cluster balancing and fault tolerance. Just an idea...

ness1602 · Feb 2, 2024

You could bring up ceph on pbs, just install whole pve on it

alyarb · Feb 2, 2024

Tiny Ceph clusters are going to be slow to begin with, and if you are power-constrained and trying to operate permanently on such a cluster, i would reconsider Ceph entirely. Replicated ZFS may be better for a 2-server solution.

Too many entry-level Ceph users are drawn to the idea of running tiny clusters of top-heavy nodes. It does not perform, and will be very risky in the event of a failure.

Ceph is meant to be run over large numbers of well-balanced nodes. Not 2x 10 TB prod nodes and a single 50 TB PBS node.

PBS should remain as PBS, and not participate in any production services. Always keep prod and backup physically separate.

user73937393 · Feb 2, 2024

Perhaps @alyarb is right, after all.

It is so easy to be attracted to the prospect of a distributed file system that you forget the overhead and the intended use case PVE developers had when introducing support for Ceph.

For example, if you have the setup I proposed, you are sacrificing 50% of your raw storage to fault tolerance since you would be replicating data stores to both nodes. Under the recommended 3-node configuration, you only sacrifice 33% of your raw storage. So, if you think about it, there's little/no savings to underpowering the third node and you're better off have 3x nodes with the same specs.

If you compare that to just using ZFS, while there is a lag between ZFS replication, you can selectively replicate your most important VMs, utilising less storage. And potentially sacrifice less of your CPU/RAM resources to Ceph.

OK. I've decided this has been an exercise of vanity, and the most sensible move is to use ZFS.

I'm sorry if this was a waste of time for the participants, but maybe this thread will be useful to someone else in the future.

sb-jw · Feb 2, 2024

user73937393 said:
I'm sorry if this was a waste of time for the participants

You made a decision despite my help. Your decision then seems to be that you will not use CEPH. Ultimately, it doesn't matter what your decision is, because I still helped you.

And if I saw this as a waste of time, I wouldn't respond.
What bothers me more is when you have to ask questions several times before you get an answer, when previous answers are simply edited/deleted/changed or expanded later, or when you have to wait several days for an answer even though the creator opened the thread with high criticality .

Your thread wasn't one of those, so from my point of view it's all good and not a waste of time. Sometimes such threads are also helpful to recapitulate your knowledge or to understand and check details yourself.

ness1602 · Feb 2, 2024

If you want to learn, i would always recommend going with 3-node ceph,and lets say 2.5gbit card in every node just for CEPH communication. It will work okayish,and you will learn a lot from it.
You can always add zfs disk-per-node, parallel to CEPH, if you want to learn zfs-sync/replication and snapshots/whatever.
Yes, CEPH loves more nodes, but i maintain a few 3-node clusters(prod) and i'm more than satisfied how they work, with hdd/ssd/nvme pools. Don't get discouraged, just to learn with whatever you have,and always do backups

alexskysilk · Feb 2, 2024

user73937393 said:
If you compare that to just using ZFS, while there is a lag between ZFS replication, you can selectively replicate your most important VMs, utilising less storage. And potentially sacrifice less of your CPU/RAM resources to Ceph.

ceph and zfs serve different purposes. zfs is single host attached storage. ceph is a scaleout filesystem. For your usecase, the correct question should be what is your order of priorities- it doesnt sound like investing in a scaleout filesystem is high on that list.

user73937393 · Feb 2, 2024

@alexskysilk For some background information, I currently have a VMware cluster through my VMUG subscription. I want to move away from VMware. I've decided that Proxmox is the right VE for me.

Now, while setting up my new cluster, I also want to address some of the risky things I did in my quick and dirty VMware setup. My objectives are:

RAID the HDs in my data stores.
Replicate my most important VMs to more than one host (e.g. domain controllers)
Run containers side-by-side to VMs.
Scheduled VM backups.

Now that I've had a chance to discuss this with Proxmox experts like the people in this thread, I've decided to use ZFS + Storage Replication. Remembering that this is a development environment, the VMs I am going to be replicating are going to be small and fairly dormant (e.g. domain controllers). For the highly used VMs like database servers, I'll use application-based HA (e.g. Microsoft SQL Server Clustering, etc.).

So, in a sense, you are correct. The scaleout system is not high on the list of priorities. I wanted to use the Proxmox cluster as an opportunity to learn about Ceph, but using it as the datastores does not sound like a good idea for the small cluster I am putting together. If later I grow the cluster to a 3-node or 5-node cluster, then maybe I will change my mind on that.

alexskysilk · Feb 3, 2024

user73937393 said:
RAID the HDs in my data stores.

Thats wise. zfs gives you the best possible combination of performance and features(!) in a pve environment.

user73937393 said:
Replicate my most important VMs to more than one host (e.g. domain controllers)

never replicate domain controllers. instead, have two of them on separate hosts. replicating hosts can end up with an insane database state.

user73937393 said:
Run containers side-by-side to VMs.

check!

user73937393 said:
Scheduled VM backups.

You didnt mention backups up to now- good you did. what do you intend to use for this (and PLEASE dont say you're gonna use one of the two servers for backup purposes...)

user73937393 · Feb 3, 2024

Excuse the domain controllers. That's a bad example of a VM I want to replicate between two hosts. I think you get the idea though.

The plan was to run PBS on a physically separate node.

user73937393 · Feb 6, 2024

Another thing to consider with the Ceph configuration for fellow home lab users is that you need to have a UPS with enough juice to run all 3x of your servers (or at least two of your servers) in the event of a power outage.

This was another reason I decided to use ZFS.

My thoughts are that if there is a power outage, on a ZFS cluster, an automated script can shut down my non-critical VMs and run on a single Proxmox host to conserve the UPS. Whereas on the Ceph cluster, my most important VMs are the ones I would likely want to run on the Ceph storage. So, if I were to use Ceph instead, that automated script would need to migrate those VMs to storage local to the primary Proxmox host as an initial step before I could shut down the 2nd/3rd Proxmox host.

ness1602 · Feb 6, 2024

ZFS will also lose data in some cases, but more important thing is to get ssds with PLP, than anything else.

Search

Search

Minimal Ceph Cluster (2x Compute Nodes and 1x Witness -- possible?)

user73937393

New Member

sb-jw

Famous Member

user73937393

New Member

alyarb

Renowned Member

sb-jw

Famous Member

user73937393

New Member

ness1602

Famous Member

alyarb

Renowned Member

user73937393

New Member

sb-jw

Famous Member

ness1602

Famous Member

alexskysilk

Distinguished Member

user73937393

New Member

alexskysilk

Distinguished Member

user73937393

New Member

user73937393

New Member

ness1602

Famous Member

We value your privacy