CEPH cluster planning

Byron · Dec 2, 2020

Hi people!

I'm planning a CEPH cluster which will go in production at some point but first will serve as a testing setup.
We need 125TB usable storage initially, with a cap of about 2PB.
The cluster will feed 10 intensive users initially, up to 100 later on. The loads are generally read heavy (large datasets with 500kb+ files).

Currently we are planning to get started with 4 nodes (replication factor 3)
-2U server with 12 sata slots
-EPYC 7282
-8x 32GB memory
-2x 250GB Sata SSD such as Samsung 860 EVO 250GB for OS.
-6x 18TB HDD
-2x NVMe SSD
-dual 10G NIC SFP+

The NVMe SSDs would serve as rocksdb/WAL, 1 SSD per 6 drives.
My main questions are;
-Is it so that without SSD for rocksdb/WAL we will be severely limited in write performance? Cost is a big consideration.
-What kind of SSDs should be considered (size/model)? I've been reading lots of advice, ranging from a fixed 30-40GB per OSD to 4% of the OSD. Generally people seem to recommend against consumer SSDs in favor of SSDs like Samsung 983.
-Should we get 3 SSDs instead of 2? (we intend to install 12 drives later; one per 4 drives instead of one per 6)?

This is to replace a RAID10 setup (5+5 10TB drives), we aim to have similar performance or better.

Thanks in advance!

aaron · Dec 3, 2020

Hi,

If you do the calculation of much space you can use, please keep in mind that the full limits for each OSD are the following:
near full warning: 85%
full: 95%

So calculating with a bit of extra space is always good. Especially if you want to use snapshots for the VMs from time to time.
Right now the rough calculation gives you 144TB raw space in the pool, but the 125TB are already 86.8% of that and this does not include any differences in space calculation resulting from base 10 or base 2 counting. So I would suggest adding at least one more OSD per node. This would result in roughly 75% usage which also does not give you a lot of leeway.

Maybe also keep in mind that an OSD could fail at any time and if you have enough space available, the Ceph cluster can recreate the data on the lost OSD on the remaining ones.

Byron said:
-Is it so that without SSD for rocksdb/WAL we will be severely limited in write performance? Cost is a big consideration.

Depending on how fast the HDDs are you will definitely see quite the performance gain. But if cost is really such an issue, you can try to go without them first and if the performance isn't good enough you can get the SSDs and recreate the OSDs using them as WAL/DB device.

Byron said:
-What kind of SSDs should be considered (size/model)? I've been reading lots of advice, ranging from a fixed 30-40GB per OSD to 4% of the OSD. Generally people seem to recommend against consumer SSDs in favor of SSDs like Samsung 983.

Definitely no consumer grade SSDs. Make sure to NOT buy read intensive SSDs as their write performance and durability is not good. Regarding sizing, I am a bit out of my depths and others will most likely be able to give you some infos there.

Byron said:
-dual 10G NIC SFP+

Byron said:
This is to replace a RAID10 setup (5+5 10TB drives), we aim to have similar performance or better.

You have to consider, that with Ceph you not only have the local IO stack down to the disks but on multiple machines via the network. Latency therefore is likely to be higher. The 10G NICs will most likely not be enough and become a bottleneck.

If you haven't seen them yet, check out the Ceph benchmark papers:
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

The one from 2018 has some benchmarks with different network speeds.

Byron · Dec 3, 2020

Thanks for your extended reply, greatly appreciated!

aaron said:
If you do the calculation of much space you can use, please keep in mind that the full limits for each OSD are the following:
near full warning: 85%
full: 95%

So calculating with a bit of extra space is always good. Especially if you want to use snapshots for the VMs from time to time.
Right now the rough calculation gives you 144TB raw space in the pool, but the 125TB are already 86.8% of that and this does not include any differences in space calculation resulting from base 10 or base 2 counting. So I would suggest adding at least one more OSD per node. This would result in roughly 75% usage which also does not give you a lot of leeway.

When going over 50-60% we intend to start adding drives/servers.

aaron said:
Depending on how fast the HDDs are you will definitely see quite the performance gain. But if cost is really such an issue, you can try to go without them first and if the performance isn't good enough you can get the SSDs and recreate the OSDs using them as WAL/DB device.

Sequential write of a single disk would be around ~200-250MB/s, our drives (HC550) have fairly low IOPS/TB so I would think the performance will suffer. I've read a case where without WAL device, write performance ended up 1/3 of the write performance of a single disk so we definitely need to avoid that.

We're looking at 3x 2TB enterprise NVMe drives per server which have a good track record as cache drives. Not quite 4% (2.9%) but it'll be better than nothing.

aaron said:
Definitely no consumer grade SSDs. Make sure to NOT buy read intensive SSDs as their write performance and durability is not good. Regarding sizing, I am a bit out of my depths and others will most likely be able to give you some infos there.

One of the most helpful resources I could find was this blog: Ceph: how to test if your SSD is suitable as a journal device? | Sébastien Han A bit outdated but still helpful. It'd be nice if we could add more up to date data on Proxmox' forum.

aaron said:
You have to consider, that with Ceph you not only have the local IO stack down to the disks but on multiple machines via the network. Latency therefore is likely to be higher. The 10G NICs will most likely not be enough and become a bottleneck.

This is a spinning rust cluster, not pure flash as the 2018 benchmark paper. We will not get close to the bandwidth of what the 10G NICs can do. (and if we do, we'll rejoice and add another NIC if possible).

Thanks again for taking the time to reply in good detail.

Search

Search

CEPH cluster planning

Byron

Member

aaron

Proxmox Staff Member

Byron

Member

We value your privacy