Pardon my less-than-intelligent question, but is there a way to install Proxmox on a Ceph cluster?

1) The Ceph cluster is being evaluated to take over for my main "do-it-all" Proxmox server
Ok. lets touch on this. From my perspective, there are two types of storage (there are more but in scope.) There is payload (think OS and application) storage and bulk storage. Bulk storage can most efficiently be served by a single device such as your 36 bay with slow spinning drives; accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.

since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs. Ceph hates hard drives, and your 100GB backbone serves no purpose when the drives are good for 0.3, speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?

I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment? Noise, heat, and power draw are real issues to resolve. You mentioned the wife not happy with spending- I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill :p
 
  • Like
Reactions: Johannes S
accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.
I don't understand where you guys are getting this idea from.

As shown above, with the system that has four nodes where each node has six 3.5" HDD bays, as long as those four nodes presents the six HDDs as OSDs per node, then I get 24 OSDs with four nodes, which, with either a (6,2) (or (8,2) as you mention, in your case) EC, I can get 74.56% storage efficiency account to this erasure coding calculator.

Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?

(Someone else on the Level1Techs forum said nearly exactly the same thing and I have yet to see any justification for this. My thinking is that as long as the nodes can supply the OSDs, and I am running a minimum of three nodes for a quorum, then I can have (6,2) (or (8,2) as you mention) EC. Therefore; if I bought the aforementioned system which has four nodes and each node has six 3.5" HDD bays, then each node can contribute six OSDs to the Ceph cluster, for a total of 24 OSDs split between said four nodes. I don't understand where you guys are getting the idea that I would need eight (for (6,2) EC) or ten (for (8,2) EC) comes from.

If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.

But if I want it to be able to divide evenly, then I can have five nodes, each supply two OSDs, but it can still be a (8,2) EC. So I am not sure where you guys are getting this from.)

I dont think you're looking to make a data center in your house.
Technically, I already have somewhere between like 12 or 13 "nodes" that I can use. (Two Z420s, two 5950X towers, a 7950X, a 6700K, two 3930K, four dual Xeon E5-2690 (v1), a 4930K, and my "do-it-all" Proxmox (dual Xeon E5-2697A v4) whatever that works out to be.

The old towers can be repurposed for this as they have can have upto eight 3.5" HDD bays in some of my older tower systems.

(This also doesn't include two 8-bay QNAP NASes (uses the Annapurina Labs AL832 processor I think), and also my old 12-bay dual Xeon L5310 server as well.)

So I don't have to buy the aforementioned system, but I was looking at it because it is relatively inexpensive (vs. EPYC, for example), but I could also just repurpose older stuff that I already have if all it is going to be doing is running Ceph.

And all of this already fits in my office.

since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs.
Yeah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs. Wayyyy too expensive to do that now.

Ceph hates hard drives
A lot of things technically hate HDDs, but we make them work with it anyways because $/GB, HDDs still rule the bulk storage world.

and your 100GB backbone serves no purpose when the drives are good for 0.3
Oh I know. But it's there and I already have it, and it takes the traffic load off the GbE PHY layers since smart TVs come with 100 Mbps RJ45 ports, not QSFP28 IB ports. (At least both my Samsung TVs does.)

speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?
I used to have and run my own CAE company (HPC/CFD/FEA/CAE) and those applications work a lot better with 100 Gbps IB as the system interconnect rather than GbE. And on a $/Gbps basis (even back then), it was cheaper for me to go with 100 Gbps IB (36-port MSB-7890 switch back then was ~$2230 USD) where as 10 GbE would be cheaper on an absolute cost basis, but more expensive on a $/Gbps (either total switching capacity, or on a per-port) basis than said 100 Gbps IB.

Therefore; I skipped 2.5G, 5G, 10G, 25G, 40G, 50G entirely and went straight to 100 Gbps IB.

The only thing that is under evaluation now is for me to swap out the MSB-7890 with a MSB-7800 for ~$700 USD. (I originally had a MSB-7800, but the switch itself had issues because it was rebooting itself on the hour, every hour, so I returned that one.)

But yes, from a $/Gbps perspective, it wasn't cost efficient.

I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment?
It's a hobby now. One that I am trying to keep expenses/costs contained.

My office.

It doesn't generate that much heat/noise. Air exhaust temp about 150 mm away from the back of my "do-it-all" Proxmox server, nominally, is only around like 34.5 C. Something like that. Maybe 40 C. It only gets hotter when the system is actively working on something (e.g. running pixz or something like that), but that's the nominal air exhaust temp.

Noise is similar in the sense that it only gets louder when the system is actively working on something, else is a nominal hum. Wife can close the door to my office if it really bothers her, but if she just hangs out/stays upstairs, she can't even hear it with the noise being confined to my office downstairs.

I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill :p
So....that's a bit ironic in the sense that since we've moved to a bigger house, my systems is struggling to keep the house as warm vs. our older, smaller house, where the heat from the systems was actually doing a fair big of lifting in terms of keeping the house warm.

From a cost efficiency perspective, natural gas heating is cheaper on a $/MJ basis than electric heating via my systems.

On the other hand though, if we didn't have my servers, then we would've been paying for a variety of streaming services, which after a while, it'll add up to more than the cost of the systems and the electricity costs.

So, as such, the systems ability to also at least partially heat our newer, bigger house, is a fringe benefit (in the winter) and a fringe detriment (in the summer, as that will have to be cooled).

But, that's where the overall system power efficiency calculations come into play where, yes, I could spend money to buy newer-to-me old enterprise server systems (like said aforementioned server) where for a given power consumption, it would be able to do more computational work, or the flip side now (with how much RAM and storage costs), is for me to hold off on buying anything new-to-me, and just re-use what I already have.

Older stuff is not as power efficient (Z420 systems with its single E5-2690 (v1) idles at around 140 W. And my 3930K and 4930K idle at around 200 W. But it's still me not needing to buy new-to-me stuff.

Either way, I'm going to be paying (whether it comes out of initial capex and reduced electricity cost over time or electricity costs over time (by using the older systems that I already have)).

Unfortunately, for a "leap" in efficiency for me, would mean spending somewhere around $4000-5000 USD to switch over to a dual EPYC 7763 system, and wife would definitely kill me for that.

So I am trying to keep the overall total system cost efficiency in check (because if I buy new-to-me, with RAM prices where they are now, the TARR on that looks really crappy right now) and my old(er) systems can run Ceph, but the X9 platform/systems can't do IOMMU for example, which means no GPU passthrough.

(I bought my 5950X systems and my 7950X system because wife was complaining about the noise that my old quad half-width node blade server (with the dual E5-2690 (v1) was making), so she has effectively banned me from turning that thing back on again. But as a piece of equipment, I mean, I do have it, and it can be turned on to serve up Ceph. Wife just doesn't want me to because of how loud that thing is. She put up with it when I had my CAE services company because at least I was making money with said noise. But now that I stopped doing that since the beginning of COVID (when companies were scaling back), I "replaced" that with two 5950X and a 7950X compute nodes instead.

The NH-D15 CPU HSF is a lot quieter. Bigger, less physically dense than the 2U four node Supermicro setup, but less noise. And more computationally efficient and also just computationally faster.

But yeah....
 
If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.
In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)

----
Spontaneous testing of an uninitiated user (me) : I have 4 OSD * 3 nodes = 12 OSD in total. I run pveceph pool create ec82 --erasure-coding k=8,m=2. Now the PVE Gui shows me "ec82-data" --> Size/min=10/9. Nine chunks must be successfully written before "success" is signaled. This requires all three nodes to be available. And I am sure the "9" is there for a good reason ;-)
 
Last edited:
  • Like
Reactions: Johannes S
I don't understand where you guys are getting this idea from.

Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?
from understanding failure domains. damn @UdoB beat me to the punch. I wont "professor" you on this. You can either read and understand, or deploy your preconcieved notions and learn on your flesh and blood. I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.

a note about IB- link speed had nothing to do with my surprise :) you have a different use case with virtualization then your previous HPC use case(s). IPOIB has definite shortcomings vs Eth as its layer 3 only- you really lose out on Layer2 functionality. Consequently, its a red headed stepchild here. it CAN be made to work for specific purposes as long as you dont intend to use it for vm bridges (layer 2.) I actually had IPOIB in deployment around 10 years ago but not now, and would not recommend its deployment in 2026 for any reason.

eah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs
I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.
 
  • Like
Reactions: Johannes S
You will lose three or four OSDs if any node dies. This implies definitively data loss, right?
Then how do people have like > 300 OSDs???

Surely they're not having 300+ nodes too.

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)
Gotcha.

Therefore; with four nodes, the most that I would be able to do would be a (3,1) EC then, correct?

(I've only been running a very tiny (2,1) EC with 3 nodes.)

I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.
Hence why I am doing the research now.

(Why wouldn't it be more performant than my current ZFS setup? load average last night went North of 2500. (Storage subsystem was very busy last night as it is prepping to write data to LTO-8 tape, running four pixz compression jobs simultaneously, and also cloning a VM in PVE.)

I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.
Sure.
 
Then how do people have like > 300 OSDs???
Surely they're not having 300+ nodes too.
First let me say this: I have zero experience in that large scale setups.

It depends on "k+m" of course. It seems to be a good idea to have less than m OSDs on each single node. If you lose one node you should not lose m OSDs, but one less.

Let's say I have ten servers with 30 OSD each, to keep the "300 OSDs" idea. And I want to max out storage capacity, while accepting lowest performance. In my current understanding this would lead to k=269,m=31.

Of course that would be absolutely crazy ;-) Every single OSD is involved to write each and every data block --> from 300 physical disks we get the IOPS of a single one. And that's the maximum, ignoring the implicated vast amount of network traffic...

Let's say I want to allow "only" 80 OSDs to be required, while those 300 are available. The problem here is that the physical presence of 30 OSDs on each host is still the same - on the first glance. The 31 is still needed to allow one node to vanish completely. "k=49,m=31" is what comes into my mind.

Probably/hopefully the EC placement algorithm would make sure to store data only on eight OSDs distributed onto the 10 nodes. If this is right the "80" leads to "k=71,m=9". This allows to lose one full node with eight OSDs while keeping one single checksum available.

Probably there are more additional pitfalls than I can think of... :-)
 
Last edited:
In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...
The failure domain must never be the OSD.

With failure domain = host you only have one copy or one chunk of the erasure coded object in one host. All the other copies or chunks live on other hosts.
That is why you need at least three hosts for replication (better four to be able recover) and k+m+1 hosts for erasure coding (+1 for being able to recover).

Please do not some crazy things like k=71,m=9. This would split a RADOS object into 80 chunks. You would add 80 times your network latency to any write or read operation. And you would need 81 hosts in your cluster.
Replication usually goes with 3 or 4 copies (if data is really important) and erasure coding has something like k=5, m=2 or k=8, m=3 depending on initial cluster size.

There may be special cases when the cluster is not "flat" and you introduce racks or rooms into the topology. Last years I build a cluster that was spread over 4 rooms with k=4 and m=3. The crush rule places at most 2 chunks in each room and on different hosts in each room. This way the setup is able to withstand the loss of a complete room and still has one chunk more than really needed (pool size=7, min_size=6).
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
Then how do people have like > 300 OSDs???
The number of OSDs isn't relevant to a pool as long as it is larger then the minimum required by the crush rule. For example, If you have an EC profile of K=8,N=2 rule, you need a minimum of 10 OSDs DISTRIBUTED ACROSS 10 NODES. so 1 OSD per node. you can have more OSDs and more nodes and data will be deployed to them according to the 8+2 rule.

A word about K+N values: in general, the absolute minimum count for EC/replica us 2. The reason 2 is minimum is because the design of the pool allows the dynamic reduction for a single node, which will leave the remaining OSD nodes to serve a fully coherent pool with redundancy/parity. This is not ZFS and the default rules disallow write operation if there is no parity. As for K values= remember that, just as in a zfs stripeset, the larger the K value the larger the granularity of write payload. Writing small files to a 8 member wide stripe (32k) would result in padding, poor performance, and capacity waste. 8 is the highest count for data blocks for the vast majority of applications, but for virtualization workloads its horribly inefficient which is why replication is how most people use it.

Why wouldn't it be more performant than my current ZFS setup?
because Ceph has a different use case and design philosophy. an OSD is NOT a disk drive, it is a process requiring data in, algorithm processing, data out. ZFS is designed for a single node, and therefore scales in performance with vdev count on the node. Ceph is designed to serve clusters on multiple levels, and scale with numbers of OSDs AND NODES subject to the crush rule used in the pool.

Moreover, EC code in ceph has been the least favorite child and the state of the code (certainly with regards to performance) has not been as well developed. There has been a substantial improvement with FastEC in Tentacle, but I would wager that while faster its still slow, at least in usecases outside the design criteria for that code- and I can promise it wasnt for virtualization workloads.
 
but for virtualization workloads its horribly inefficient which is why replication is how most people use it.
But doesn't replication yield lower storage efficiency?

I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.

but I would wager that while faster its still slow, at least in usecases outside the design criteria for that code- and I can promise it wasnt for virtualization workloads.
The fastest that I have been able to achieve, using Intel 670 p 1 TB NVMe SSDs (during some testing) and my 100 Gbps IB network, and because the LXCs and VMs live on the Ceph EC distributed pool/cluster, I've been able to live-migrate a VM at 133 Gbps out of 100 Gbps, thanks to compression.

So, it can be performant, but it kinda varies a little bit.

(Yesterday, I didn't realise that one of my LXCs was running out of RAM, so it start swapping. And on a GbE network (because I'm using N95 mini PCs for my test cluster), it was reading at upwards of 295 MiB/s (due to compression probably) as it was swapping to Ceph over said GbE. Granted, GbE is terrible for Ceph, but it worked enough for what I need it to do.

@UdoB @gurubert
I read what both of you wrote, but I can't say that I fully understand it, or at least not yet.

Looks like I have more learning/research that I need to do before I can fully grasp these concepts beyond my little EC (2,1) Ceph cluster with N95 mini PCs.

Thank you for educating me.
 
But doesn't replication yield lower storage efficiency?
"lower" and "higher" are subjective. Ceph achieves HA using raw capacity.

I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.
suit yourself. this is not a recommended deployment. You are far better served by just having two SEPERATE VMs each serving all those functions without any ceph at all- you'll have better fault tolerance, better performance, and better storage utilization.

The fastest that I have been able to achieve, using Intel 670 p 1 TB NVMe SSDs (during some testing) and my 100 Gbps IB network, and because the LXCs and VMs live on the Ceph EC distributed pool/cluster, I've been able to live-migrate a VM at 133 Gbps out of 100 Gbps, thanks to compression.
I dont understand what you are reporting. no disk IO moves on live migration of a shared disk. and memory only migrates for virtual machines, not lxc. in any event, if you really want to check your read performance, you can just use fio in your guest.
 
"lower" and "higher" are subjective. Ceph achieves HA using raw capacity.
If you're using EC (6,2), that's about 75% storage efficiency. If you are using replicated rule with the same eight nodes, and assuming each node contributes 1 OSD, then you have only about 12.5% storage efficiency.

12.5% < 75% no?

You are far better served by just having two SEPERATE VMs each serving all those functions without any ceph at all- you'll have better fault tolerance, better performance, and better storage utilization.
Can't.

Said mini PCs only has a 2242 M.2 NVMe SSD slot.

Each node by itself can't hold much, whereas Ceph is able to pool the storage together so that I can do/run more things on said cluster given limited physical space for additional or larger storage devices.

(It was cheap ($150 USD each when I bought them). And it was also my first foray into Proxmox clustering and I wanted something cheap and low power to experiment with.)

Given this physical limitation where you only have a single 2242 M.2 slot, (so at the time when I bought it, Microcenter had 512 GB 2242 M.2 NVMe SSDs available, of which 128 GB was partitioned off for Proxmox, and the rest was for Ceph).

I'm not sure how I would be able to implement your proposed solution, given this physical limitation of the nodes themselves.

(Again, the budget for this cluster for $450 USD. With that, I have three N95 processors, a total of 48 GB of RAM, and a total of 1.5 TB of NVMe SSD storage space. If you can come up with an example of how I would implement your solution, for that price, and get the same or better performance, RAM, and total storage capacity, I'd gladly entertain your proposal if I can make my system better. But at $450 USD, I think that it would be quite a challenge. And on top of all of that, your idle power profile cannot exceed 21 W (because at idle, each node is only consuming between 4-7 W. My point is that there are reasons why and how I ended up with this deployment.)

I dont understand what you are reporting. no disk IO moves on live migration of a shared disk. and memory only migrates for virtual machines, not lxc.
I was running tests with both VMs and LXCs, but yes, VMs only have memory migration, but it didn't always hit 100 Gbps speeds on said 100 Gbps IB system interconnect.

You said that it wasn't designed for virtualisation workloads, but it seemed performant enough for said VM workloads, especially over 100 Gbps IB. (But even with my current mini PC cluster setup over GbE, EC still works for VMs as well.)
 
If you're using EC (6,2), that's about 75% storage efficiency. If you are using replicated rule with the same eight nodes, and assuming each node contributes 1 OSD, then you have only about 12.5% storage efficiency.
No no no. You got your math wrong.
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.
 
  • Like
Reactions: Johannes S and UdoB
I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.
This is not recommended and certainly not HA. With m=1 you cannot loose a single disk.
An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case.
 
  • Like
Reactions: Johannes S
No no no. You got your math wrong.
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.
Can you show me your math so that I can learn from it?

Is the need for triple replication because it's m=2?

I would like to learn how to arrived at this conclusion if you can expand a little further on how you are doing the math that leads you to this conclusion.

(As I said, I still have a lot to learn.)
 
An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case.
But I've got that with my three nodes, no?

k = 2
m = 1

size=2+1 = 3 (which is what I have)

min_size = k + 1 = 2 + 1 = 3 (which I have three nodes).

So, I am struggling a little bit, in trying to understanding how size = min_size = 3 in my case, wouldn't fit the bill.

(Like I've had it where say one of my nodes went down for whatever reason (don't remember why the node went down the last time, spontaneously), but when I brought it back up Ceph checked out everything, and everything went back to healthy status, but whilst it was down, it was still able to keep the pool lumping along, in a degraded state, until I brought the 3rd node back up.)
 
Although they were about replicated pools (so no ec) following reads might serve as a hint why (outside of experiments/lab setups) it's not a good idea to go against the recommendations:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

OP: I would carefully reconsider whether you actually need a cluster (live-migration is also possible with the Proxmox Datacenter Manager between single-nodes) or (if you want it for high-availability) if ZFS-based replication is a better fit for your needs. Another option would be to put all your discs in one dedicated NAS and use that as shared storage. If the NAS fails of course your whole cluster will lost access to the data but it would still be a lot more sensible setup than your plan with running an ec-pool in a non-recommended and rather dangerous way.
 
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.
K=6,M=2 results in 6 data strips per 8 total. 6/8=0.75
in replication you have 1 data strips per 3 total. 1/3=0.33

its not exactly the "same" availability because survivability in a replication group is much higher; you need one living osd per pg to recover, whereas with an EC 6+2 you need 6. In reality, survivability is typically fine both ways in a well constructed pool (more then the minimum number of OSD+node and at least 30% free space,) Its just that EC pools are best suited for large block low IOPs use cases, and Replication is best suited for high iops since there are so many more smaller PGs give then same quantity of hardware.
 
  • Like
Reactions: Johannes S