Pardon my less-than-intelligent question, but is there a way to install Proxmox on a Ceph cluster?

alexskysilk · Feb 22, 2026

alpha754293 said:
1) The Ceph cluster is being evaluated to take over for my main "do-it-all" Proxmox server

Ok. lets touch on this. From my perspective, there are two types of storage (there are more but in scope.) There is payload (think OS and application) storage and bulk storage. Bulk storage can most efficiently be served by a single device such as your 36 bay with slow spinning drives; accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.

since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs. Ceph hates hard drives, and your 100GB backbone serves no purpose when the drives are good for 0.3, speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?

I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment? Noise, heat, and power draw are real issues to resolve. You mentioned the wife not happy with spending- I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill

alpha754293 · Feb 22, 2026

alexskysilk said:
accomplishing the same capacity efficiency with ceph requires ~10 nodes in a 8+2 EC configuration- I dont think you're looking to make a data center in your house.

I don't understand where you guys are getting this idea from.

As shown above, with the system that has four nodes where each node has six 3.5" HDD bays, as long as those four nodes presents the six HDDs as OSDs per node, then I get 24 OSDs with four nodes, which, with either a (6,2) (or (8,2) as you mention, in your case) EC, I can get 74.56% storage efficiency account to this erasure coding calculator.

Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?

(Someone else on the Level1Techs forum said nearly exactly the same thing and I have yet to see any justification for this. My thinking is that as long as the nodes can supply the OSDs, and I am running a minimum of three nodes for a quorum, then I can have (6,2) (or (8,2) as you mention) EC. Therefore; if I bought the aforementioned system which has four nodes and each node has six 3.5" HDD bays, then each node can contribute six OSDs to the Ceph cluster, for a total of 24 OSDs split between said four nodes. I don't understand where you guys are getting the idea that I would need eight (for (6,2) EC) or ten (for (8,2) EC) comes from.

If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.

But if I want it to be able to divide evenly, then I can have five nodes, each supply two OSDs, but it can still be a (8,2) EC. So I am not sure where you guys are getting this from.)

alexskysilk said:
I dont think you're looking to make a data center in your house.

Technically, I already have somewhere between like 12 or 13 "nodes" that I can use. (Two Z420s, two 5950X towers, a 7950X, a 6700K, two 3930K, four dual Xeon E5-2690 (v1), a 4930K, and my "do-it-all" Proxmox (dual Xeon E5-2697A v4) whatever that works out to be.

The old towers can be repurposed for this as they have can have upto eight 3.5" HDD bays in some of my older tower systems.

(This also doesn't include two 8-bay QNAP NASes (uses the Annapurina Labs AL832 processor I think), and also my old 12-bay dual Xeon L5310 server as well.)

So I don't have to buy the aforementioned system, but I was looking at it because it is relatively inexpensive (vs. EPYC, for example), but I could also just repurpose older stuff that I already have if all it is going to be doing is running Ceph.

And all of this already fits in my office.

alexskysilk said:
since I imagine the "bulk" of your storage fits in the "bulk" category, figure out how much payload storage you actually need, and size the ceph solution accordingly WITH FAST SSDs.

Yeah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs. Wayyyy too expensive to do that now.

alexskysilk said:
Ceph hates hard drives

A lot of things technically hate HDDs, but we make them work with it anyways because $/GB, HDDs still rule the bulk storage world.

alexskysilk said:
and your 100GB backbone serves no purpose when the drives are good for 0.3

Oh I know. But it's there and I already have it, and it takes the traffic load off the GbE PHY layers since smart TVs come with 100 Mbps RJ45 ports, not QSFP28 IB ports. (At least both my Samsung TVs does.)

alexskysilk said:
speaking of... 100GB IB?!?! WHY?! do you already have switches that are free?

I used to have and run my own CAE company (HPC/CFD/FEA/CAE) and those applications work a lot better with 100 Gbps IB as the system interconnect rather than GbE. And on a $/Gbps basis (even back then), it was cheaper for me to go with 100 Gbps IB (36-port MSB-7890 switch back then was ~$2230 USD) where as 10 GbE would be cheaper on an absolute cost basis, but more expensive on a $/Gbps (either total switching capacity, or on a per-port) basis than said 100 Gbps IB.

Therefore; I skipped 2.5G, 5G, 10G, 25G, 40G, 50G entirely and went straight to 100 Gbps IB.

The only thing that is under evaluation now is for me to swap out the MSB-7890 with a MSB-7800 for ~$700 USD. (I originally had a MSB-7800, but the switch itself had issues because it was rebooting itself on the hour, every hour, so I returned that one.)

But yes, from a $/Gbps perspective, it wasn't cost efficient.

alexskysilk said:
I am getting the sense that this is a hobby and not a business.... WHERE are you putting all this equipment?

It's a hobby now. One that I am trying to keep expenses/costs contained.

My office.

It doesn't generate that much heat/noise. Air exhaust temp about 150 mm away from the back of my "do-it-all" Proxmox server, nominally, is only around like 34.5 C. Something like that. Maybe 40 C. It only gets hotter when the system is actively working on something (e.g. running pixz or something like that), but that's the nominal air exhaust temp.

Noise is similar in the sense that it only gets louder when the system is actively working on something, else is a nominal hum. Wife can close the door to my office if it really bothers her, but if she just hangs out/stays upstairs, she can't even hear it with the noise being confined to my office downstairs.

alexskysilk said:
I can only imagine how happy she'll be if you spend money on a always on noisy space heater that doubles your electric bill

So....that's a bit ironic in the sense that since we've moved to a bigger house, my systems is struggling to keep the house as warm vs. our older, smaller house, where the heat from the systems was actually doing a fair big of lifting in terms of keeping the house warm.

From a cost efficiency perspective, natural gas heating is cheaper on a $/MJ basis than electric heating via my systems.

On the other hand though, if we didn't have my servers, then we would've been paying for a variety of streaming services, which after a while, it'll add up to more than the cost of the systems and the electricity costs.

So, as such, the systems ability to also at least partially heat our newer, bigger house, is a fringe benefit (in the winter) and a fringe detriment (in the summer, as that will have to be cooled).

But, that's where the overall system power efficiency calculations come into play where, yes, I could spend money to buy newer-to-me old enterprise server systems (like said aforementioned server) where for a given power consumption, it would be able to do more computational work, or the flip side now (with how much RAM and storage costs), is for me to hold off on buying anything new-to-me, and just re-use what I already have.

Older stuff is not as power efficient (Z420 systems with its single E5-2690 (v1) idles at around 140 W. And my 3930K and 4930K idle at around 200 W. But it's still me not needing to buy new-to-me stuff.

Either way, I'm going to be paying (whether it comes out of initial capex and reduced electricity cost over time or electricity costs over time (by using the older systems that I already have)).

Unfortunately, for a "leap" in efficiency for me, would mean spending somewhere around $4000-5000 USD to switch over to a dual EPYC 7763 system, and wife would definitely kill me for that.

So I am trying to keep the overall total system cost efficiency in check (because if I buy new-to-me, with RAM prices where they are now, the TARR on that looks really crappy right now) and my old(er) systems can run Ceph, but the X9 platform/systems can't do IOMMU for example, which means no GPU passthrough.

(I bought my 5950X systems and my 7950X system because wife was complaining about the noise that my old quad half-width node blade server (with the dual E5-2690 (v1) was making), so she has effectively banned me from turning that thing back on again. But as a piece of equipment, I mean, I do have it, and it can be turned on to serve up Ceph. Wife just doesn't want me to because of how loud that thing is. She put up with it when I had my CAE services company because at least I was making money with said noise. But now that I stopped doing that since the beginning of COVID (when companies were scaling back), I "replaced" that with two 5950X and a 7950X compute nodes instead.

The NH-D15 CPU HSF is a lot quieter. Bigger, less physically dense than the 2U four node Supermicro setup, but less noise. And more computationally efficient and also just computationally faster.

But yeah....

UdoB · Feb 22, 2026

alpha754293 said:
If I am running (8,2) EC, then I would need at least 10 OSD, supplied by a minimum of three nodes (for a quorum). But being that 10 doesn't divide evenly by 3, so I can either have one node that supplies 3 OSDs, another node that supplies another 3 OSDs, and a third node that supplies 4 OSDs.

In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)

----
Spontaneous testing of an uninitiated user (me) : I have 4 OSD * 3 nodes = 12 OSD in total. I run pveceph pool create ec82 --erasure-coding k=8,m=2. Now the PVE Gui shows me "ec82-data" --> Size/min=10/9. Nine chunks must be successfully written before "success" is signaled. This requires all three nodes to be available. And I am sure the "9" is there for a good reason ;-)

alexskysilk · Feb 22, 2026

alpha754293 said:
I don't understand where you guys are getting this idea from.

alpha754293 said:
Where do you guys get the idea that if you have either (6,2) (or (8,2) EC) that you need (k+m) number of nodes from?

from understanding failure domains. damn @UdoB beat me to the punch. I wont "professor" you on this. You can either read and understand, or deploy your preconcieved notions and learn on your flesh and blood. I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.

a note about IB- link speed had nothing to do with my surprise

you have a different use case with virtualization then your previous HPC use case(s). IPOIB has definite shortcomings vs Eth as its layer 3 only- you really lose out on Layer2 functionality. Consequently, its a red headed stepchild here. it CAN be made to work for specific purposes as long as you dont intend to use it for vm bridges (layer 2.) I actually had IPOIB in deployment around 10 years ago but not now, and would not recommend its deployment in 2026 for any reason.

alpha754293 said:
eah, it is too costly (especially now) to replace 288 TB of raw capacity with SSDs

I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.

alpha754293 · Feb 22, 2026

UdoB said:
You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Then how do people have like > 300 OSDs???

Surely they're not having 300+ nodes too.

UdoB said:
Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

Disclaimer: I am not a Ceph specialist! (But I made some experiments a year ago, w/o EC though - https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/)

Gotcha.

Therefore; with four nodes, the most that I would be able to do would be a (3,1) EC then, correct?

(I've only been running a very tiny (2,1) EC with 3 nodes.)

alexskysilk said:
I would also note that if your expectations that this 4 node ceph+EC+HDDs will be faster than your original NAS you have an unpleasant surprise waiting for you.

Hence why I am doing the research now.

(Why wouldn't it be more performant than my current ZFS setup? load average last night went North of 2500. (Storage subsystem was very busy last night as it is prepping to write data to LTO-8 tape, running four pixz compression jobs simultaneously, and also cloning a VM in PVE.)

alexskysilk said:
I dont think you got my point.

you dont NEED 288TB of application storage. If your data breakdown is what I think it is, ~250TB of it is effectively a "hoard" and doesnt need to be live at all and can be spun up for access. ~37TB (and I'm likely overestimating this) is media and other large block data which gets accessed at playback bitrates (so <25mbit.) You're left with 1TB or less of application storage.

Were I in your shoes, I'd leave that original NAS alone, and deploy 3 low power nodes with 3-4 SSDs each as your compute cluster.

Sure.

UdoB · Feb 23, 2026

alpha754293 said:
Then how do people have like > 300 OSDs???
Surely they're not having 300+ nodes too.

First let me say this: I have zero experience in that large scale setups.

It depends on "k+m" of course. It seems to be a good idea to have less than m OSDs on each single node. If you lose one node you should not lose m OSDs, but one less.

Let's say I have ten servers with 30 OSD each, to keep the "300 OSDs" idea. And I want to max out storage capacity, while accepting lowest performance. In my current understanding this would lead to k=269,m=31.

Of course that would be absolutely crazy ;-) Every single OSD is involved to write each and every data block --> from 300 physical disks we get the IOPS of a single one. And that's the maximum, ignoring the implicated vast amount of network traffic...

Let's say I want to allow "only" 80 OSDs to be required, while those 300 are available. The problem here is that the physical presence of 30 OSDs on each host is still the same - on the first glance. The 31 is still needed to allow one node to vanish completely. "k=49,m=31" is what comes into my mind.

Probably/hopefully the EC placement algorithm would make sure to store data only on eight OSDs distributed onto the 10 nodes. If this is right the "80" leads to "k=71,m=9". This allows to lose one full node with eight OSDs while keeping one single checksum available.

Probably there are more additional pitfalls than I can think of...

gurubert · Feb 23, 2026

UdoB said:
In my understanding usually the failure domain is "host". I need to be able to shutdown/reboot one node for maintenance. And I want everything to stay alive when (not: if) one node has any kind of problem.

You will lose three or four OSDs if any node dies. This implies definitively data loss, right?

Personally I would want to be able to lose a node and still have at least single parity/redundancy being intact...

The failure domain must never be the OSD.

With failure domain = host you only have one copy or one chunk of the erasure coded object in one host. All the other copies or chunks live on other hosts.
That is why you need at least three hosts for replication (better four to be able recover) and k+m+1 hosts for erasure coding (+1 for being able to recover).

Please do not some crazy things like k=71,m=9. This would split a RADOS object into 80 chunks. You would add 80 times your network latency to any write or read operation. And you would need 81 hosts in your cluster.
Replication usually goes with 3 or 4 copies (if data is really important) and erasure coding has something like k=5, m=2 or k=8, m=3 depending on initial cluster size.

There may be special cases when the cluster is not "flat" and you introduce racks or rooms into the topology. Last years I build a cluster that was spread over 4 rooms with k=4 and m=3. The crush rule places at most 2 chunks in each room and on different hosts in each room. This way the setup is able to withstand the loss of a complete room and still has one chunk more than really needed (pool size=7, min_size=6).

alexskysilk · Feb 23, 2026

alpha754293 said:
Then how do people have like > 300 OSDs???

The number of OSDs isn't relevant to a pool as long as it is larger then the minimum required by the crush rule. For example, If you have an EC profile of K=8,N=2 rule, you need a minimum of 10 OSDs DISTRIBUTED ACROSS 10 NODES. so 1 OSD per node. you can have more OSDs and more nodes and data will be deployed to them according to the 8+2 rule.

A word about K+N values: in general, the absolute minimum count for EC/replica us 2. The reason 2 is minimum is because the design of the pool allows the dynamic reduction for a single node, which will leave the remaining OSD nodes to serve a fully coherent pool with redundancy/parity. This is not ZFS and the default rules disallow write operation if there is no parity. As for K values= remember that, just as in a zfs stripeset, the larger the K value the larger the granularity of write payload. Writing small files to a 8 member wide stripe (32k) would result in padding, poor performance, and capacity waste. 8 is the highest count for data blocks for the vast majority of applications, but for virtualization workloads its horribly inefficient which is why replication is how most people use it.

alpha754293 said:
Why wouldn't it be more performant than my current ZFS setup?

because Ceph has a different use case and design philosophy. an OSD is NOT a disk drive, it is a process requiring data in, algorithm processing, data out. ZFS is designed for a single node, and therefore scales in performance with vdev count on the node. Ceph is designed to serve clusters on multiple levels, and scale with numbers of OSDs AND NODES subject to the crush rule used in the pool.

Moreover, EC code in ceph has been the least favorite child and the state of the code (certainly with regards to performance) has not been as well developed. There has been a substantial improvement with FastEC in Tentacle, but I would wager that while faster its still slow, at least in usecases outside the design criteria for that code- and I can promise it wasnt for virtualization workloads.

alpha754293 · Feb 24, 2026

alexskysilk said:
but for virtualization workloads its horribly inefficient which is why replication is how most people use it.

But doesn't replication yield lower storage efficiency?

I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.

alexskysilk said:
but I would wager that while faster its still slow, at least in usecases outside the design criteria for that code- and I can promise it wasnt for virtualization workloads.

The fastest that I have been able to achieve, using Intel 670 p 1 TB NVMe SSDs (during some testing) and my 100 Gbps IB network, and because the LXCs and VMs live on the Ceph EC distributed pool/cluster, I've been able to live-migrate a VM at 133 Gbps out of 100 Gbps, thanks to compression.

So, it can be performant, but it kinda varies a little bit.

(Yesterday, I didn't realise that one of my LXCs was running out of RAM, so it start swapping. And on a GbE network (because I'm using N95 mini PCs for my test cluster), it was reading at upwards of 295 MiB/s (due to compression probably) as it was swapping to Ceph over said GbE. Granted, GbE is terrible for Ceph, but it worked enough for what I need it to do.

@UdoB @gurubert
I read what both of you wrote, but I can't say that I fully understand it, or at least not yet.

Looks like I have more learning/research that I need to do before I can fully grasp these concepts beyond my little EC (2,1) Ceph cluster with N95 mini PCs.

Thank you for educating me.

alexskysilk · Feb 24, 2026

alpha754293 said:
But doesn't replication yield lower storage efficiency?

"lower" and "higher" are subjective. Ceph achieves HA using raw capacity.

alpha754293 said:
I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.

suit yourself. this is not a recommended deployment. You are far better served by just having two SEPERATE VMs each serving all those functions without any ceph at all- you'll have better fault tolerance, better performance, and better storage utilization.

alpha754293 said:
The fastest that I have been able to achieve, using Intel 670 p 1 TB NVMe SSDs (during some testing) and my 100 Gbps IB network, and because the LXCs and VMs live on the Ceph EC distributed pool/cluster, I've been able to live-migrate a VM at 133 Gbps out of 100 Gbps, thanks to compression.

I dont understand what you are reporting. no disk IO moves on live migration of a shared disk. and memory only migrates for virtual machines, not lxc. in any event, if you really want to check your read performance, you can just use fio in your guest.

alpha754293 · Feb 24, 2026

alexskysilk said:
"lower" and "higher" are subjective. Ceph achieves HA using raw capacity.

If you're using EC (6,2), that's about 75% storage efficiency. If you are using replicated rule with the same eight nodes, and assuming each node contributes 1 OSD, then you have only about 12.5% storage efficiency.

12.5% < 75% no?

alexskysilk said:
You are far better served by just having two SEPERATE VMs each serving all those functions without any ceph at all- you'll have better fault tolerance, better performance, and better storage utilization.

Can't.

Said mini PCs only has a 2242 M.2 NVMe SSD slot.

Each node by itself can't hold much, whereas Ceph is able to pool the storage together so that I can do/run more things on said cluster given limited physical space for additional or larger storage devices.

(It was cheap ($150 USD each when I bought them). And it was also my first foray into Proxmox clustering and I wanted something cheap and low power to experiment with.)

Given this physical limitation where you only have a single 2242 M.2 slot, (so at the time when I bought it, Microcenter had 512 GB 2242 M.2 NVMe SSDs available, of which 128 GB was partitioned off for Proxmox, and the rest was for Ceph).

I'm not sure how I would be able to implement your proposed solution, given this physical limitation of the nodes themselves.

(Again, the budget for this cluster for $450 USD. With that, I have three N95 processors, a total of 48 GB of RAM, and a total of 1.5 TB of NVMe SSD storage space. If you can come up with an example of how I would implement your solution, for that price, and get the same or better performance, RAM, and total storage capacity, I'd gladly entertain your proposal if I can make my system better. But at $450 USD, I think that it would be quite a challenge. And on top of all of that, your idle power profile cannot exceed 21 W (because at idle, each node is only consuming between 4-7 W. My point is that there are reasons why and how I ended up with this deployment.)

alexskysilk said:
I dont understand what you are reporting. no disk IO moves on live migration of a shared disk. and memory only migrates for virtual machines, not lxc.

I was running tests with both VMs and LXCs, but yes, VMs only have memory migration, but it didn't always hit 100 Gbps speeds on said 100 Gbps IB system interconnect.

You said that it wasn't designed for virtualisation workloads, but it seemed performant enough for said VM workloads, especially over 100 Gbps IB. (But even with my current mini PC cluster setup over GbE, EC still works for VMs as well.)

gurubert · Feb 24, 2026

alpha754293 said:
If you're using EC (6,2), that's about 75% storage efficiency. If you are using replicated rule with the same eight nodes, and assuming each node contributes 1 OSD, then you have only about 12.5% storage efficiency.

No no no. You got your math wrong.
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.

gurubert · Feb 24, 2026

alpha754293 said:
I am currently using EC (2,1) for my "simple" three node Proxmox HA cluster which serves my DNS, Windows AD DC, and AdGuardHome and where the LXC/VM disks reside on the distributed EC Ceph cluster.

This is not recommended and certainly not HA. With m=1 you cannot loose a single disk.
An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case.

alpha754293 · Feb 24, 2026

gurubert said:
No no no. You got your math wrong.
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.

Can you show me your math so that I can learn from it?

Is the need for triple replication because it's m=2?

I would like to learn how to arrived at this conclusion if you can expand a little further on how you are doing the math that leads you to this conclusion.

(As I said, I still have a lot to learn.)

gurubert · Feb 24, 2026

alpha754293 said:
Can you show me your math so that I can learn from it?

Is the need for triple replication because it's m=2?

Yes. In erasure coded pools with m=2 you can lose 2 OSDs for one PG at the same time without losing data.
The same can be achieved in replicated pools with size=3. You can lose 2 OSDs for a PG without losing its data.

alpha754293 · Feb 24, 2026

gurubert said:
An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case.

But I've got that with my three nodes, no?

k = 2
m = 1

size=2+1 = 3 (which is what I have)

min_size = k + 1 = 2 + 1 = 3 (which I have three nodes).

So, I am struggling a little bit, in trying to understanding how size = min_size = 3 in my case, wouldn't fit the bill.

(Like I've had it where say one of my nodes went down for whatever reason (don't remember why the node went down the last time, spontaneously), but when I brought it back up Ceph checked out everything, and everything went back to healthy status, but whilst it was down, it was still able to keep the pool lumping along, in a degraded state, until I brought the 3rd node back up.)

gurubert · Feb 24, 2026

With size=min_size you cannot lose any OSDs without losing write access to the affected objects.

And it has nothing to do with number of nodes or number of OSDs.

Johannes S · Feb 24, 2026

Although they were about replicated pools (so no ec) following reads might serve as a hint why (outside of experiments/lab setups) it's not a good idea to go against the recommendations:

T

Thread 'Ceph pool size (is 2/1 really a bad idea?)'

Apr 25, 2020

Situation: I've got a 3 node cluster, and want to use ceph for HA storage for VM's.
When making a ceph pool the default is 3/2 meaning you only have 33% of your total storage as capacity. So I'm thinking of making a 2/1 pool, having 50% of my capacity available. This would mean 1 node can fail, or any single osd can fail and the cluster would still keep running. Why is this a bad idea?

Specifically I've read some stuff about min size being the size at which no writes are allowed. Is this true? If so, why is 3/2 the default, after all wont the cluster stop running on a single node failure...

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

OP: I would carefully reconsider whether you actually need a cluster (live-migration is also possible with the Proxmox Datacenter Manager between single-nodes) or (if you want it for high-availability) if ZFS-based replication is a better fit for your needs. Another option would be to put all your discs in one dedicated NAS and use that as shared storage. If the NAS fails of course your whole cluster will lost access to the data but it would still be a lot more sensible setup than your plan with running an ec-pool in a non-recommended and rather dangerous way.

alexskysilk · Feb 24, 2026

gurubert said:
To achieve the same availability as EC with k=6 and m=2 you need triple replication (three copies) meaning a storage efficiency of 33%. It is rarely necessary to go beyond 4 copies.

K=6,M=2 results in 6 data strips per 8 total. 6/8=0.75
in replication you have 1 data strips per 3 total. 1/3=0.33

its not exactly the "same" availability because survivability in a replication group is much higher; you need one living osd per pg to recover, whereas with an EC 6+2 you need 6. In reality, survivability is typically fine both ways in a well constructed pool (more then the minimum number of OSD+node and at least 30% free space,) Its just that EC pools are best suited for large block low IOPs use cases, and Replication is best suited for high iops since there are so many more smaller PGs give then same quantity of hardware.

alpha754293 · Feb 25, 2026

gurubert said:
With size=min_size you cannot lose any OSDs without losing write access to the affected objects.

And it has nothing to do with number of nodes or number of OSDs.

gurubert said:
An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case.

gurubert said:
Yes. In erasure coded pools with m=2 you can lose 2 OSDs for one PG at the same time without losing data.

I'm so confused.

Based on the last statement, therefore, tweaking what you wrote:

"In erasure coded pools with m=1 you can lose 1 OSD for one PG at a time without losing data."

But what you actually wrote was:

"With size=min_size you cannot lose any OSDs without losing write access to the affected objects."

But then you also wrote;

"An erasure coded pool should have size=k+m and min_size=k+1 settings which would be size=3 and min_size=3 in your case."

I'm so confused as to why m=1, and size=min_size, "you cannot lose any oSDs without losing write access to the affected objects."

Why would the applicability of the statements be different whether it's m=2 or m=1? I don't fully understand this yet.

Johannes S said:
OP: I would carefully reconsider whether you actually need a cluster (live-migration is also possible with the Proxmox Datacenter Manager between single-nodes) or (if you want it for high-availability) if ZFS-based replication is a better fit for your needs. Another option would be to put all your discs in one dedicated NAS and use that as shared storage. If the NAS fails of course your whole cluster will lost access to the data but it would still be a lot more sensible setup than your plan with running an ec-pool in a non-recommended and rather dangerous way.

So I'm already running a single NAS with all of the storage in said single "do-it-all" Proxmox box.

Where I am running into issues is that I am seeing the load averages spike up to 2500 when the storage subsystem is very busy, so the thought or the idea is if I de-centralise said storage and use multiple nodes with the same number of bulk data drives (with each drive being its own OSD), then I might be able to improve throughput, responsiveness, and performance, even with HDDs.

Either way, I'm running 24 HDDs.

But my ZFS pool has been getting hit hard lately which results in my load averages spiking > 2500, on a system that has dual 16-core/32-thread Xeon E5-2697A v4s.

So the theory for the experiment is to buy the previously linked Supermicro 4U, 4-node server, have 6 HDDs in each node, and then serve up Ceph over 100 Gbps IB, to see if that might help my storage subsystem be more responsive and more performant, under very heavy loads.

alexskysilk said:
K=6,M=2 results in 6 data strips per 8 total. 6/8=0.75
in replication you have 1 data strips per 3 total. 1/3=0.33

its not exactly the "same" availability because survivability in a replication group is much higher;

Well...survivability/availability isn't the same as storage efficiency.

Your calculation, I think was for storage efficiency.

alexskysilk said:
whereas with an EC 6+2 you need 6.

Right. Gotcha.

But if I am buying the Supermicro 4U, 4-node system, then it is guiding me towards changing the EC from 6,2 down to 3,1.

alexskysilk said:
Its just that EC pools are best suited for large block low IOPs use cases, and Replication is best suited for high iops since there are so many more smaller PGs give then same quantity of hardware.

Well...that's always the challenge, because that's just from the techincal side. But then there's the "political" side of actual cost, in some kind of nominal, fiat currency, be it USD or any other fiat currency, along with the HDDs that will go into each of nodes as well.

The closest that I've seen to an 8-node system would be the Supermicro microcloud that Craft Computing was talking about, where you have 8 nodes in 3U and each node supports upto two 3.5" HDDs.

And you can buy the bare chassis (with PSUs) for about $400 USD, shipped, but then each node that you will need to fill up said chassis typically runs between $200-250 USD each (*8 = $2k) plus HDDs. (Even HGST 6 TB SATA HDDs right now are going for close to $100 ea, so for 16 HGST 6 TB SATA HDDs, that's an additional $1600 USD on top of that.

$400 + $2000 + $1600 = $4k USD, just to bring this experiment up online. And this will only give me about half of the raw capacity as my current bulk storage ZFS pool on my "do-it-all" Proxmox server.

This is why I am trying to do my research before I go out and start buying stuff rather than starting to buy it and then realise that this isn't going to be a $400 project, but a $4000 project, that may or may not achieve my stated objectives for this said project.

Pardon my less-than-intelligent question, but is there a way to install Proxmox on a Ceph cluster?

Distinguished Member

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Distinguished Member

Member

We value your privacy