Ceph in 2026 HDD

Lost_My_Ones_and_Zeros · Jun 26, 2026

With the endless pain of only a few companies taking all inventory from the normal people... The hard question has to be asked in a day when HDD are mostly archive or cold tier. I haven't used a HD outside of backup targets in years... It's a bad time to be in hardware refresh cycle.

Proxmox 9.2.x
Ceph - Current enterprise repos.

I've read the forums and most are around older versions of Ceph largely based on Proxmox 7 or early 8 versions. I know considerable changes have taken place with Ceph and it is a lot better over the years with smaller cluster setups. I have it on several small 3 node ones and have no problems for the workload it delivers. Latency is a little higher then ZFS which isn't a surprise as it's network based not local storage access.

Looking to do a ceph cluster. 5 nodes dedicated network 25Gb or 100Gb for ceph alone.
2 SSD OSD enterprise drives
6HDD OSD sas based
Ensure code for pool 8/3 should work.

Primary workload is massive file shares and low utilization VM's and a few DB servers. Current environment we generally transfer a little over 200Tb a day, lots of people working on the same projects. Not a lot overall but not a small number either. What are the general limits with HDD with Ceph? Latency and througput are large consideration as currently our other ageing vendor setup it a little lack luster but it hasn't had a problem delivering maxed 1Gb connections to several users for the past few years its just slow to start which isn't ideal. We still run several database servers on prim and with that 3ms letancy target is generally considerd max letancy for access and aiming to stick or lower if possible. Not sure how Ceph handles tierd approch as I woudnt have any db or wal drives if I wanted to use the ssds as the main tier. Tiering with ceph is like voodoo to me it's something I just haven't done.

Not likely that I'll need 100Gb nic with HD's in the mix and 25 bonded should be more then fine for 5 nodes. Math says I wouldn't hit the limit of a single 25Gb port in real world use cases.

Honest descussion as versions change over time. Last time I used HDD with ceph I swore off Proxmox and Ceph all togther for years becuase of just how bad it was. Using SSD I don't have to many issues... Not none just not many. With current pricing on SSD's I'm into the 7 figure mark to get the capacity I would need which just a year ago when I did the same quote it was sub 350k all in. This s*** has gotten out of hand which is why HDD are back in the mix.

cyruspy · Jun 26, 2026

Volume seems low, go with SAS/SSD or NVMe. Ceph per se doesn't do tiering.Unless you are able to do an educated placement of vdisks per pool per requirement, don't go with rotational disks.

Lost_My_Ones_and_Zeros · Jun 26, 2026

Problem is I'm looking to cut cost so HDD in today's market is the only way to go. I have no interest in trying to spend around a million usd for ssd's today.

How bad is spinning rust with Ceph in the newer version. Quote from this morning was 69k per server with HDD and SSD. The SSD only "preferred obviously" was over 190k per server. Obviously that's a hard one to justify...

cyruspy · Jun 26, 2026

Well, I've seen happy users purchasing second hand Enterprise SSD/SAS disks. Would that work?

Money is going to NVMe these days.

I've seen Ceph from other vendors with bcache to paste both SSD and HDD. Haven't done it with PVE.

On the other hand, can you just create 2 pools and create disks per roll I. The correct pool?

Lost_My_Ones_and_Zeros · Jun 26, 2026

Negative this is a production place that can't stand downtime. These will say in place to 5 to 7 years. I personally don't mind 2nd hand as I use them for lab's but never production.

Yeah no kidding.

bcache seems a little like rdb cache? Wouldn't setting rdb cache closer to the actual cache on the HDD be beneficial as it defaults to 32mb when most drives now have 256mb seems like setting it to 192 would be a safe spot so as to feed to much but this is just a guess as not really familiar with with it.

RobFantini · Jun 26, 2026

Have you already purchased the motherboards?

If not price out single cpu systems with plenty of memory. The savings over dual cpu ( assuming you were pricing that out) may leave cash to afford nvme’s or data a center grade ssd

cyruspy · Jun 26, 2026

RobFantini said:
Have you already purchased the motherboards?

If not price out single cpu systems with plenty of memory. The savings over dual cpu ( assuming you were pricing that out) may leave cash to afford nvme’s or data a center grade ssd

Hmm, in the order of magnitude of things, that's not a thing anymore. AMD for example can give incentive to vendors for that to disappear.

The biggest issue is on memory and flash disks. Cost went x3-x6 depending on your volume and regular discount before the bubble.

alexskysilk · Jun 26, 2026

I think you're asking the wrong question.

what is the workload in vm count and minimum required iops/vm? how much capacity do you need?

ceph works fine with hard drives. its just slow.

Lost_My_Ones_and_Zeros · Jun 26, 2026

CPU and board are trivial cost today.

These are supermicro servers prebuilt. The set and forget build not a desktop or workstation build. Looking for 600Tb raw... While cheaper to roll your own it's been a pita to source U2 or E3 drives in quantity. I can normally find someone selling 5 or 6 at a time then I end up with mismatch drives. Hence the 2026 HDD question. How bad is it really.

Currently I'm looking at replacing a Nutanix cluster as they have gone the nickle and dime route while forcing Prism central for everything which eats more resources then it is a benefit.

alexskysilk · Jun 26, 2026

Lost_My_Ones_and_Zeros said:
How bad is it really.

Ask a subjective question, get subjective answers. its either terrible if you're looking for 500kIOPS, or great if you're looking for 100IOPs.

Lost_My_Ones_and_Zeros · Jun 26, 2026

Yeah that's kind of expected. Guess i'll have to throw a few in a test cluster and see how the newer version performance. Everyone knows it will be dog **** but hey never hurts to ask maybe some dev work went into it. My guess is with how things are now we might see some work head to the old spinning rust. I hate coding so not going to volunteer.

Think it's better to go back to legacy server and san style as its easier to do zfs with a pool of raidz2 and get decent performance out of it. Now to dust off the old how the hell do I do active/active on zfs again lol. I know it can be done just it's been awhile. Cheaper to do a massive storage shelf then it is do converged at least with ZFS more memory is more fun.

alexskysilk · Jun 26, 2026

ceph and zfs serve different purposes. ceph is a cluster store. zfs is host attached. while you can make a zfs filer act as cluster storage that still leave it as a SPOF in a HA environment. If you actually want to shortcut the lab you might want to ask specific questions, eg, what client performance have you achieved with 6 osd nodes, each with 8 HDDs (or whatever your proposed configuration is.)

jdancer · Jun 28, 2026

Been migrating Dell VMware clusters running SAS HDDs over to Proxmox Ceph.

As you know, Ceph is a scale-out solution. More nodes/OSDs = more IOPS. These Dells are 2U 16-drive bay servers. Use 2 small SAS drives to mirror Proxmox using ZFS RAID-1. Rest of drives are OSDs. Servers are running homogeneous hardware, ie, same CPU, memory, firmware, storage controller, NICs. Clusters range from 3-nodes to 11-nodes. Obviously, the larger clusters are more performant than the smaller clusters.

Not hurting for IOPS. I use the following optimizations learned through trial-and-error. YMMV.

Code:

    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type for Linux to 'Host'
    Set VM CPU Type for Windows to 'x86-64-v2-AES' on older CPUs/'x86-64-v3' on newer CPUs/'nested-virt' on Proxmox 9.1+
    Set VM CPU NUMA
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed and VirtIO drivers on Windows
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pools to use 'krbd' option
    Set Ceph 'bluestore_prefer_deferred_size_hdd = 0' in osd stanza in /etc/pve/ceph.conf for SAS HDD
    Set Ceph 'bluestore_min_alloc_size_hdd = 65536' in osd stanza in /etc/pve/ceph.conf for SAS HDD
    Set Ceph Erasure Coding profiles to 'plugin=ISA' & 'technique=reed_sol_van'
    Set Ceph Erasure Coding profiles to 'stripe_unit=65536' for SAS HDD

J-Rod · Jun 30, 2026

Ceph caching/tiering has been deprecated.

I'm not sure why you wouldn't want RocksDB/WAL on SSDs when contemplating hard drives with Ceph, although even with that I think you could expect production workload performance to be highly variable. It could even come to a halt if Ceph needed to perform any kind of maintenance or repair.

HCI and/or Ceph is not the fix to everything. SANs are a great solution for many situations, and even budget models ship with an extensive list of features these days. When the Dell ME series arrays came out about seven years ago, they brought three-level tiering at a great price, for example, along with the usual features such as snapshots, replication, hot expansion, fast rebuilds, etc.

Nutanix AHV added external storage support last year, but only for NVMe over TCP, so of limited use in this case.

Proxmox VE supports iSCSI, but it's perhaps a double-edged sword due to the features you have to give up, such as thin provisioning. Probably still a valid solution in this scenario.

You could run your VMs in an SSD-based HCI cluster but then use iSCSI within the VMs to map to an external HDD SAN for bulk storage workloads.

guruevi · Jun 30, 2026

The main bottleneck with hard drives is that they’re relatively slow. So ~100 IOPS peak and maybe 5-10MB/s in random read write scenarios. You need to read/write 100TB/day -> 1200MB/sec -> you need 300 disks just to keep up multiplied by the desired redundancy (typically between x1.5 and x3). A Windows 11 VM expects about 500-1000 IOPS (SSD is a requirement), that is 5 spinning disks, per VM. Again, you get to the 300-1000 range just to operate a small cluster.

That is regardless of whatever solution you pick, you can’t get around the physics that disks are relatively slow. Disks are still used, but you need hundreds of them, at which point, you are talking about petabyte archival systems.

At this point your solution is irrelevant. ZFS won’t outperform Ceph on hard drives, given the same loads. Your total operating and ownership costs will likely be higher than the SSD, depending on your capacity.

J-Rod · Jun 30, 2026

Actually, you can easily "get around the physics" of hard drives by using caching/tiering, which we've had for decades on RAID controllers and SANs alike. It's not something FOSS has made a priority in recent years, although there is the special VDEV in ZFS and aforementioned RocksDB/WAL for Ceph.

For RAID cards, we started with battery-backed DRAM that could act as a write cache for immediate write acknowledgement but also allow more writes to accumulate before going to disk, reducing total disk IOPs. Then later you could add dedicated SSDs for a read cache for popular data. An example of this is the LSI Logic MegaRAID CacheCade. Many server vendors such as Dell offered this feature under their own branded cards, but of course under the hood it was the LSI product.

For SANs, you also had a battery-backed DRAM in the controllers, and it was typically a lot larger, such as 8-16GB. Then you also have tiering using multiple HDD classes or mixed flash/HDD, such that you could write a lot of IOPs, but less total data to the top tiers, then later write fewer IOPs but more data to the lower tiers, as well as the obvious read caching.

Even in the late 2000s NetApp offered large DRAM-based caching, called PAM (up to 80GB), then not long after, flash-based "Flash Cache" (up to 4TB in 2010), as well as "Flex Cache" that could be a very large third level cache for reads. So even back then we were typically getting 10X more IOPs than the hard drives could theoretically provide at 10X less latency.

Some SANs like Compellent actually allowed each physical disk to host multiple RAID levels simultaneously to balance performance and capacity. So for example, writes would go to the SSD tier as a three-way mirror, then after some hours, colder data would be re-written as a much wider dual-parity RAID stripe, saving space. The hard drive tier below the SSDs were typically configured as dual parity RAID, but could also be split in a similar manner to accept writes directly.

Also, many enterprise SAN vendors offered deduplication and compression, which further amplified the available HDD IOPs.

guruevi · Jun 30, 2026

Those were hacks and only accelerated specific workloads or patterns. Cache misses still gave the worst case scenario performance and VM and databases both are workloads that should be calculated accordingly (unlike filer NAS).

Ceph/Linux has a read cache and can be set to have a write cache too. But forcing async io also increases risks. Which is why those BBU often became SPOF, batteries fail, controllers failed, RAM still scrambles in power outage situations. Beyond the fact many solutions were more expensive and proprietary.

SSD and SDS improved durability, performance and reduced cost significantly. Once you get into spindle counts, BBU, proprietary RAID controllers, FibreChannel, the purported cost savings goes out the window.

Here’s a quick several quotes I made on my Dell account to get to 100TB and something that on paper will give you at least 50k IOPS:
Dell NVMe Ceph cluster (12x8T*3 node): $360k
Dell PowerStore hybrid, 20% cache, redundant (+5y license): $660k, add in support for another $40-50k. Note that from disk, you probably will get no more than ~3k IOPS.
Dell PowerVault (iSCSI hybrid w/tiering): $300k barebones (need 2 to avoid SPOF, add warranty and support)
Dell HDD (12x8T*20 node): $440k

alexskysilk · Jun 30, 2026

guruevi said:
You need to read/write 100TB/day -> 1200MB/sec -> you need 300 disks just to keep up multiplied by the desired redundancy (typically between x1.5 and x3).

thats aggregate- not to a single initiator. important distinction. and you dont fix latency with any OSD count.

J-Rod · Jul 1, 2026

Those weren't "hacks". Most storage vendors offered such features and they worked well. Battery-backed write-back caching alone provided a huge boost.

Reliability was excellent. Neglect any system or design a solution poorly and you can expect trouble. For SANs, you have dual controllers with cache mirroring between them providing redundancy, and if there was a battery or DRAM issue, the controller would switch to write-through mode temporarily to ensure data safety. Even budget solutions like the Dell PowerVault MD3000i offered this 20 years ago.

I configured a dual controller 3-tier Dell PowerVault ME5224 with two ME512 disk shelves for a total capacity of ~300TB and the cost was $100k list. (Most Dell customers can expect a 10%-30% discount.) I'm not saying this is a solution for the OP. Just a quick example of a something for say a medium-sized business that needs a lot of mixed-workload storage.

Side note, a very basic tiering function is coming to TrueNAS Enterprise in v26 and v27: https://www.youtube.com/watch?v=fgEECn49MN8
Synology also recently added hot/cold tiering (it's very limited): https://kb.synology.com/en-us/DSM/help/Tiering/tiering_desc?version=7

gurubert · Jul 2, 2026

Lost_My_Ones_and_Zeros said:
Ensure code for pool 8/3 should work.

For erasure coding with k=8 and m=3 you need at least 12 nodes.

Ceph in 2026 HDD

New Member

Renowned Member

New Member

Renowned Member

New Member

Famous Member

Renowned Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Distinguished Member

Member

Distinguished Member

We value your privacy