Pardon my less-than-intelligent question, but is there a way to install Proxmox on a Ceph cluster?

1) The Ceph cluster is being evaluated to take over for my main "do-it-all" Proxmox server
Ok. lets touch on this. From my perspective, there are two types of storage (there are more but in scope.) There is payload (think OS and application) storage and bulk storage. Bulk storage can most efficiently be served by a single device such as your 36 bay with slow spinning drives; accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.

since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs. Ceph hates hard drives, and your 100GB backbone serves no purpose when the drives are good for 0.3, speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?

I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment? Noise, heat, and power draw are real issues to resolve. You mentioned the wife not happy with spending- I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill :p
 
  • Like
Reactions: Johannes S
accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.
I don't understand where you guys are getting this idea from.

As shown above, with the system that has four nodes where each node has six 3.5" HDD bays, as long as those four nodes presents the six HDDs as OSDs per node, then I get 24 OSDs with four nodes, which, with either a (6,2) (or (8,2) as you mention, in your case) EC, I can get 74.56% storage efficiency account to this erasure coding calculator.

Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?

(Someone else on the Level1Techs forum said nearly exactly the same thing and I have yet to see any justification for this. My thinking is that as long as the nodes can supply the OSDs, and I am running a minimum of three nodes for a quorum, then I can have (6,2) (or (8,2) as you mention) EC. Therefore; if I bought the aforementioned system which has four nodes and each node has six 3.5" HDD bays, then each node can contribute six OSDs to the Ceph cluster, for a total of 24 OSDs split between said four nodes. I don't understand where you guys are getting the idea that I would need eight (for (6,2) EC) or ten (for (8,2) EC) comes from.

If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.

But if I want it to be able to divide evenly, then I can have five nodes, each supply two OSDs, but it can still be a (8,2) EC. So I am not sure where you guys are getting this from.)

I dont think you're looking to make a data center in your house.
Technically, I already have somewhere between like 12 or 13 "nodes" that I can use. (Two Z420s, two 5950X towers, a 7950X, a 6700K, two 3930K, four dual Xeon E5-2690 (v1), a 4930K, and my "do-it-all" Proxmox (dual Xeon E5-2697A v4) whatever that works out to be.

The old towers can be repurposed for this as they have can have upto eight 3.5" HDD bays in some of my older tower systems.

(This also doesn't include two 8-bay QNAP NASes (uses the Annapurina Labs AL832 processor I think), and also my old 12-bay dual Xeon L5310 server as well.)

So I don't have to buy the aforementioned system, but I was looking at it because it is relatively inexpensive (vs. EPYC, for example), but I could also just repurpose older stuff that I already have if all it is going to be doing is running Ceph.

And all of this already fits in my office.

since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs.
Yeah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs. Wayyyy too expensive to do that now.

Ceph hates hard drives
A lot of things technically hate HDDs, but we make them work with it anyways because $/GB, HDDs still rule the bulk storage world.

and your 100GB backbone serves no purpose when the drives are good for 0.3
Oh I know. But it's there and I already have it, and it takes the traffic load off the GbE PHY layers since smart TVs come with 100 Mbps RJ45 ports, not QSFP28 IB ports. (At least both my Samsung TVs does.)

speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?
I used to have and run my own CAE company (HPC/CFD/FEA/CAE) and those applications work a lot better with 100 Gbps IB as the system interconnect rather than GbE. And on a $/Gbps basis (even back then), it was cheaper for me to go with 100 Gbps IB (36-port MSB-7890 switch back then was ~$2230 USD) where as 10 GbE would be cheaper on an absolute cost basis, but more expensive on a $/Gbps (either total switching capacity, or on a per-port) basis than said 100 Gbps IB.

Therefore; I skipped 2.5G, 5G, 10G, 25G, 40G, 50G entirely and went straight to 100 Gbps IB.

The only thing that is under evaluation now is for me to swap out the MSB-7890 with a MSB-7800 for ~$700 USD. (I originally had a MSB-7800, but the switch itself had issues because it was rebooting itself on the hour, every hour, so I returned that one.)

But yes, from a $/Gbps perspective, it wasn't cost efficient.

I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment?
It's a hobby now. One that I am trying to keep expenses/costs contained.

My office.

It doesn't generate that much heat/noise. Air exhaust temp about 150 mm away from the back of my "do-it-all" Proxmox server, nominally, is only around like 34.5 C. Something like that. Maybe 40 C. It only gets hotter when the system is actively working on something (e.g. running pixz or something like that), but that's the nominal air exhaust temp.

Noise is similar in the sense that it only gets louder when the system is actively working on something, else is a nominal hum. Wife can close the door to my office if it really bothers her, but if she just hangs out/stays upstairs, she can't even hear it with the noise being confined to my office downstairs.

I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill :p
So....that's a bit ironic in the sense that since we've moved to a bigger house, my systems is struggling to keep the house as warm vs. our older, smaller house, where the heat from the systems was actually doing a fair big of lifting in terms of keeping the house warm.

From a cost efficiency perspective, natural gas heating is cheaper on a $/MJ basis than electric heating via my systems.

On the other hand though, if we didn't have my servers, then we would've been paying for a variety of streaming services, which after a while, it'll add up to more than the cost of the systems and the electricity costs.

So, as such, the systems ability to also at least partially heat our newer, bigger house, is a fringe benefit (in the winter) and a fringe detriment (in the summer, as that will have to be cooled).

But, that's where the overall system power efficiency calculations come into play where, yes, I could spend money to buy newer-to-me old enterprise server systems (like said aforementioned server) where for a given power consumption, it would be able to do more computational work, or the flip side now (with how much RAM and storage costs), is for me to hold off on buying anything new-to-me, and just re-use what I already have.

Older stuff is not as power efficient (Z420 systems with its single E5-2690 (v1) idles at around 140 W. And my 3930K and 4930K idle at around 200 W. But it's still me not needing to buy new-to-me stuff.

Either way, I'm going to be paying (whether it comes out of initial capex and reduced electricity cost over time or electricity costs over time (by using the older systems that I already have)).

Unfortunately, for a "leap" in efficiency for me, would mean spending somewhere around $4000-5000 USD to switch over to a dual EPYC 7763 system, and wife would definitely kill me for that.

So I am trying to keep the overall total system cost efficiency in check (because if I buy new-to-me, with RAM prices where they are now, the TARR on that looks really crappy right now) and my old(er) systems can run Ceph, but the X9 platform/systems can't do IOMMU for example, which means no GPU passthrough.

(I bought my 5950X systems and my 7950X system because wife was complaining about the noise that my old quad half-width node blade server (with the dual E5-2690 (v1) was making), so she has effectively banned me from turning that thing back on again. But as a piece of equipment, I mean, I do have it, and it can be turned on to serve up Ceph. Wife just doesn't want me to because of how loud that thing is. She put up with it when I had my CAE services company because at least I was making money with said noise. But now that I stopped doing that since the beginning of COVID (when companies were scaling back), I "replaced" that with two 5950X and a 7950X compute nodes instead.

The NH-D15 CPU HSF is a lot quieter. Bigger, less physically dense than the 2U four node Supermicro setup, but less noise. And more computationally efficient and also just computationally faster.

But yeah....
 
If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.
In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)

----
Spontaneous testing of an uninitiated user (me) : I have 4 OSD * 3 nodes = 12 OSD in total. I run pveceph pool create ec82 --erasure-coding k=8,m=2. Now the PVE Gui shows me "ec82-data" --> Size/min=10/9. Nine chunks must be successfully written before "success" is signaled. This requires all three nodes to be available. And I am sure the "9" is there for a good reason ;-)
 
Last edited:
  • Like
Reactions: Johannes S
I don't understand where you guys are getting this idea from.

Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?
from understanding failure domains. damn @UdoB beat me to the punch. I wont "professor" you on this. You can either read and understand, or deploy your preconcieved notions and learn on your flesh and blood. I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.

a note about IB- link speed had nothing to do with my surprise :) you have a different use case with virtualization then your previous HPC use case(s). IPOIB has definite shortcomings vs Eth as its layer 3 only- you really lose out on Layer2 functionality. Consequently, its a red headed stepchild here. it CAN be made to work for specific purposes as long as you dont intend to use it for vm bridges (layer 2.) I actually had IPOIB in deployment around 10 years ago but not now, and would not recommend its deployment in 2026 for any reason.

eah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs
I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.
 
  • Like
Reactions: Johannes S
You will lose three or four OSDs if any node dies. This implies definitively data loss, right?
Then how do people have like > 300 OSDs???

Surely they're not having 300+ nodes too.

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)
Gotcha.

Therefore; with four nodes, the most that I would be able to do would be a (3,1) EC then, correct?

(I've only been running a very tiny (2,1) EC with 3 nodes.)

I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.
Hence why I am doing the research now.

(Why wouldn't it be more performant than my current ZFS setup? load average last night went North of 2500. (Storage subsystem was very busy last night as it is prepping to write data to LTO-8 tape, running four pixz compression jobs simultaneously, and also cloning a VM in PVE.)

I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.
Sure.
 
Then how do people have like > 300 OSDs???
Surely they're not having 300+ nodes too.
First let me say this: I have zero experience in that large scale setups.

It depends on "k+m" of course. It seems to be a good idea to have less than m OSDs on each single node. If you lose one node you should not lose m OSDs, but one less.

Let's say I have ten servers with 30 OSD each, to keep the "300 OSDs" idea. And I want to max out storage capacity, while accepting lowest performance. In my current understanding this would lead to k=269,m=31.

Of course that would be absolutely crazy ;-) Every single OSD is involved to write each and every data block --> from 300 physical disks we get the IOPS of a single one. And that's the maximum, ignoring the implicated vast amount of network traffic...

Let's say I want to allow "only" 80 OSDs to be required, while those 300 are available. The problem here is that the physical presence of 30 OSDs on each host is still the same - on the first glance. The 31 is still needed to allow one node to vanish completely. "k=49,m=31" is what comes into my mind.

Probably/hopefully the EC placement algorithm would make sure to store data only on eight OSDs distributed onto the 10 nodes. If this is right the "80" leads to "k=71,m=9". This allows to lose one full node with eight OSDs while keeping one single checksum available.

Probably there are more additional pitfalls than I can think of... :-)
 
Last edited:
In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...
The failure domain must never be the OSD.

With failure domain = host you only have one copy or one chunk of the erasure coded object in one host. All the other copies or chunks live on other hosts.
That is why you need at least three hosts for replication (better four to be able recover) and k+m+1 hosts for erasure coding (+1 for being able to recover).

Please do not some crazy things like k=71,m=9. This would split a RADOS object into 80 chunks. You would add 80 times your network latency to any write or read operation. And you would need 81 hosts in your cluster.
Replication usually goes with 3 or 4 copies (if data is really important) and erasure coding has something like k=5, m=2 or k=8, m=3 depending on initial cluster size.

There may be special cases when the cluster is not "flat" and you introduce racks or rooms into the topology. Last years I build a cluster that was spread over 4 rooms with k=4 and m=3. The crush rule places at most 2 chunks in each room and on different hosts in each room. This way the setup is able to withstand the loss of a complete room and still has one chunk more than really needed (pool size=7, min_size=6).
 
Last edited:
  • Like
Reactions: Johannes S and UdoB