I recommend between 2 solutions

Oct 9, 2024
223
8
18
I'm using proxmox version 9

I'm using Proxmox Ve with 3 nodes. Each node has four 16TB HDDs. I'd like to understand which solution is better, with lower IO delay and higher reliability.

The first solution is to use ZFS RAIDZ for each node with four 16TB HDDs and use two SSDs, one for logs and one for cache.

The second solution is to use Ceph with four 16TB HDDs for each node and add two SSDs for DB/WAL.
 
With Ceph, you need fast networking for your Storage, 10 Gbit should be the absolute minimum, better 25 or 40 Gbit.
Your data will be on all 3 nodes for redundancy, if one node fails you can still work, if 2 Nodes fail your ceph is no longer writeable.

ZFS Replication works great for 2 Nodes.
You could do a 2 Node PVE Cluster, with the third node being PBS (Proxmox Backup Server).
People always forget Backup ;)
 
  • Like
Reactions: UdoB
Why do you want to use ZFS RAIDZ? With your minimal amount of disks a mirror setup is faster with a RAID10 like setup.
IMHO if you do not need to migrate your vm between the nodes, ZFS is fine. IO delay will be lower as it is local.
If you value availability more, than Ceph will be the better option. In other words: it depends.
 
For how long do you want to keep struggling with this topic until you finally realize that you need SSDs if you want better IOPS?:
 
  • Like
Reactions: Johannes S
I'm building another cluster in a separate location and since I have the same HDDs I wanted to understand whether to adopt ZFS again or try again with ceph and add the SSDs for the DB/WAL, but I think I'll wait a little longer to sell my 16 TB HDDs and buy the SDDs directly and use ceph, with this new cluster I need to use HA


I tried to do some tests with some SAMSUNG 870 EVO 1TB 1000GB 2.5 SSDs as they recommended to me, but the situation hasn't changed much, what do you think of these SAMSUNG PM1643a 7.68TB 12G SAS Read Intensive (1DWPD) SSD 2.5, or could you suggest me which ones to get, if necessary, always 2.5 SSDs?
 
Last edited:
I tried to do some tests with some SAMSUNG 870 EVO 1TB 1000GB 2.5 SSDs as they recommended to me, but the situation hasn't changed much, what do you think of these SAMSUNG PM1643a 7.68TB 12G SAS Read Intensive (1DWPD) SSD 2.5, or could you suggest me which ones to get, if necessary, always 2.5 SSDs?
I don't known who as recommend you tu use consumer ssd with zfs or ceph, but performance will be horrible. (because lack of power-loss protection, so the fsync for the zfs/ceph journal can't be cached). pm1643 (or any other enterprise ssd with plp) should works fine.
 
The first solution is to use ZFS RAIDZ for each node with four 16TB HDDs and use two SSDs, one for logs and one for cache.

If this is the only storage "for everything, but with an unknown use case": go for striped mirrors (aka Raid10) and add a fast "Special Device", consisting of two "Enterprise Class" SSD - which may be small. (Below 1 %; ~30 TB --> 300GB)

Just my 2 €¢...
 
  • Like
Reactions: Johannes S and fba
I also tried RAID10 but only had 50% space loss but little benefit, I think I will buy the SAMSUNG PM1643a 7.68TB 12G SAS Read Intensive (1DWPD) 2.5 SSD and use them with ceph, I hope these will be fine
 
I also tried RAID10 but only had 50% space loss but little benefit
are you telling me that on every OSD of 8TB or so, ceph will only use 2.666TB or so?
This is a big pet peeve for me. you dont LOSE anything. you write things multiple times so you can lose a disk and continue functioning. It is irrational to think you get to use 100% of the available disk AND handle its failure.

All fault tolerance techniques are a tradeoff- mirrors have the best random access performance at a cost of capacity; parity raid (raidz, ec, etc) yield better storage utilization but poorer random performance. When designing a storage system you need to take all those aspects into consideration.
 
Yes, of course, I was saying that if I RAIDZ with four 16TB HDDs, I'll lose one disk, and the total usable space is about 48TB.

If I RAID10, however, I lose two disks, and I'm left with a total usable space of about 32TB.

Doing these two tests, I saw little difference in performance between them, at the expense of usable space.
 
  • Like
Reactions: UdoB
I saw little difference in performance between them, at the expense of usable space.
Well..., a RAID10 (with two vdev) will give you double the IOPS of a single RaidZ vdev.

If this is relevant depends on your use case - and the specific test you run to check the behavior.