Ceph in 2026 HDD

Jun 5, 2025
11
1
3
With the endless pain of only a few companies taking all inventory from the normal people... The hard question has to be asked in a day when HDD are mostly archive or cold tier. I haven't used a HD outside of backup targets in years... It's a bad time to be in hardware refresh cycle.

Proxmox 9.2.x
Ceph - Current enterprise repos.

I've read the forums and most are around older versions of Ceph largely based on Proxmox 7 or early 8 versions. I know considerable changes have taken place with Ceph and it is a lot better over the years with smaller cluster setups. I have it on several small 3 node ones and have no problems for the workload it delivers. Latency is a little higher then ZFS which isn't a surprise as it's network based not local storage access.

Looking to do a ceph cluster. 5 nodes dedicated network 25Gb or 100Gb for ceph alone.
2 SSD OSD enterprise drives
6HDD OSD sas based
Ensure code for pool 8/3 should work.

Primary workload is massive file shares and low utilization VM's and a few DB servers. Current environment we generally transfer a little over 200Tb a day, lots of people working on the same projects. Not a lot overall but not a small number either. What are the general limits with HDD with Ceph? Latency and througput are large consideration as currently our other ageing vendor setup it a little lack luster but it hasn't had a problem delivering maxed 1Gb connections to several users for the past few years its just slow to start which isn't ideal. We still run several database servers on prim and with that 3ms letancy target is generally considerd max letancy for access and aiming to stick or lower if possible. Not sure how Ceph handles tierd approch as I woudnt have any db or wal drives if I wanted to use the ssds as the main tier. Tiering with ceph is like voodoo to me it's something I just haven't done.

Not likely that I'll need 100Gb nic with HD's in the mix and 25 bonded should be more then fine for 5 nodes. Math says I wouldn't hit the limit of a single 25Gb port in real world use cases.

Honest descussion as versions change over time. Last time I used HDD with ceph I swore off Proxmox and Ceph all togther for years becuase of just how bad it was. Using SSD I don't have to many issues... Not none just not many. With current pricing on SSD's I'm into the 7 figure mark to get the capacity I would need which just a year ago when I did the same quote it was sub 350k all in. This s*** has gotten out of hand which is why HDD are back in the mix.
 
Volume seems low, go with SAS/SSD or NVMe. Ceph per se doesn't do tiering.Unless you are able to do an educated placement of vdisks per pool per requirement, don't go with rotational disks.
 
Problem is I'm looking to cut cost so HDD in today's market is the only way to go. I have no interest in trying to spend around a million usd for ssd's today.

How bad is spinning rust with Ceph in the newer version. Quote from this morning was 69k per server with HDD and SSD. The SSD only "preferred obviously" was over 190k per server. Obviously that's a hard one to justify...
 
Last edited:
Well, I've seen happy users purchasing second hand Enterprise SSD/SAS disks. Would that work?

Money is going to NVMe these days.

I've seen Ceph from other vendors with bcache to paste both SSD and HDD. Haven't done it with PVE.

On the other hand, can you just create 2 pools and create disks per roll I. The correct pool?
 
Negative this is a production place that can't stand downtime. These will say in place to 5 to 7 years. I personally don't mind 2nd hand as I use them for lab's but never production.

Yeah no kidding.

bcache seems a little like rdb cache? Wouldn't setting rdb cache closer to the actual cache on the HDD be beneficial as it defaults to 32mb when most drives now have 256mb seems like setting it to 192 would be a safe spot so as to feed to much but this is just a guess as not really familiar with with it.
 
Have you already purchased the motherboards?

If not price out single cpu systems with plenty of memory. The savings over dual cpu ( assuming you were pricing that out) may leave cash to afford nvme’s or data a center grade ssd
 
Have you already purchased the motherboards?

If not price out single cpu systems with plenty of memory. The savings over dual cpu ( assuming you were pricing that out) may leave cash to afford nvme’s or data a center grade ssd

Hmm, in the order of magnitude of things, that's not a thing anymore. AMD for example can give incentive to vendors for that to disappear.

The biggest issue is on memory and flash disks. Cost went x3-x6 depending on your volume and regular discount before the bubble.
 
CPU and board are trivial cost today.

These are supermicro servers prebuilt. The set and forget build not a desktop or workstation build. Looking for 600Tb raw... While cheaper to roll your own it's been a pita to source U2 or E3 drives in quantity. I can normally find someone selling 5 or 6 at a time then I end up with mismatch drives. Hence the 2026 HDD question. How bad is it really.

Currently I'm looking at replacing a Nutanix cluster as they have gone the nickle and dime route while forcing Prism central for everything which eats more resources then it is a benefit.
 
Last edited:
Yeah that's kind of expected. Guess i'll have to throw a few in a test cluster and see how the newer version performance. Everyone knows it will be dog **** but hey never hurts to ask maybe some dev work went into it. My guess is with how things are now we might see some work head to the old spinning rust. I hate coding so not going to volunteer.

Think it's better to go back to legacy server and san style as its easier to do zfs with a pool of raidz2 and get decent performance out of it. Now to dust off the old how the hell do I do active/active on zfs again lol. I know it can be done just it's been awhile. Cheaper to do a massive storage shelf then it is do converged at least with ZFS more memory is more fun.
 
  • Like
Reactions: Johannes S
ceph and zfs serve different purposes. ceph is a cluster store. zfs is host attached. while you can make a zfs filer act as cluster storage that still leave it as a SPOF in a HA environment. If you actually want to shortcut the lab you might want to ask specific questions, eg, what client performance have you achieved with 6 osd nodes, each with 8 HDDs (or whatever your proposed configuration is.)
 
  • Like
Reactions: Johannes S
Been migrating Dell VMware clusters running SAS HDDs over to Proxmox Ceph.

As you know, Ceph is a scale-out solution. More nodes/OSDs = more IOPS. These Dells are 2U 16-drive bay servers. Use 2 small SAS drives to mirror Proxmox using ZFS RAID-1. Rest of drives are OSDs. Servers are running homogeneous hardware, ie, same CPU, memory, firmware, storage controller, NICs. Clusters range from 3-nodes to 11-nodes. Obviously, the larger clusters are more performant than the smaller clusters.

Not hurting for IOPS. I use the following optimizations learned through trial-and-error. YMMV.

Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type for Linux to 'Host'
    Set VM CPU Type for Windows to 'x86-64-v2-AES' on older CPUs/'x86-64-v3' on newer CPUs/'nested-virt' on Proxmox 9.1+
    Set VM CPU NUMA
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed and VirtIO drivers on Windows
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pools to use 'krbd' option
    Set Ceph 'bluestore_prefer_deferred_size_hdd = 0' in osd stanza in /etc/pve/ceph.conf for SAS HDD
    Set Ceph 'bluestore_min_alloc_size_hdd = 65536' in osd stanza in /etc/pve/ceph.conf for SAS HDD
    Set Ceph Erasure Coding profiles to 'plugin=ISA' & 'technique=reed_sol_van'
    Set Ceph Erasure Coding profiles to 'stripe_unit=65536' for SAS HDD
 
Ceph caching/tiering has been deprecated.

I'm not sure why you wouldn't want RocksDB/WAL on SSDs when contemplating hard drives with Ceph, although even with that I think you could expect production workload performance to be highly variable. It could even come to a halt if Ceph needed to perform any kind of maintenance or repair.

HCI and/or Ceph is not the fix to everything. SANs are a great solution for many situations, and even budget models ship with an extensive list of features these days. When the Dell ME series arrays came out about seven years ago, they brought three-level tiering at a great price, for example, along with the usual features such as snapshots, replication, hot expansion, fast rebuilds, etc.

Nutanix AHV added external storage support last year, but only for NVMe over TCP, so of limited use in this case.

Proxmox VE supports iSCSI, but it's perhaps a double-edged sword due to the features you have to give up, such as thin provisioning. Probably still a valid solution in this scenario.

You could run your VMs in an SSD-based HCI cluster but then use iSCSI within the VMs to map to an external HDD SAN for bulk storage workloads.