Ceph sizing recommendation

acron1234 · May 26, 2023

Ceph experts:

I plan to deploy 6 x R740 for the lab environment. Each R740 has 24 x 16 TB HDD, PERC H730P in HBA mode ,2 port 40 Gbe NIC, with 2 x 3204 Intel CPU and 256 GB RAM. OS drive is on 480 GB SSD. I plan to install write intensive enterprise 3.2 TB NVMe on each R740.

I am running PVE 7.4.3.

I plan have 3 x monitor and 1 x manager. (one of the R740 will has both monitor and manager).

Question 1: I plan to create 24 OSDs on each server, each OSD will be a dedicate 16 TB HDD. Do I use the same HDD for WAL Disk or I should use partitions on the write intensive enterprise 3.2 TB NVMe?

Question 2: for 3 x monitor, should I relocate the storage from default location which is boot OS SSD to the write intensive enterprise 3.2 TB NVMe? What's the procedure to do so?

Question 3: For 1 x manager, should I relocate the storage from default location which is boot OS SSD to the write intensive enterprise 3.2 TB NVMe? What's the procedure to do so?

Question 4: What's the best way for me to get most performance out of this cluster? (This cluster will mainly serve as high performance file system for large block read and write, there will not be used for VM or containers)

Thank you!

jsterr · May 26, 2023

I can give some answers to the monitor/manager questions. On DB/WAL Im not sure, as proxmox recommendations differ to ceph recommendations differ to real life numbers I read on ceph mailinglists.

acron1234 said:
Question 2: for 3 x monitor, should I relocate the storage from default location which is boot OS SSD to the write intensive enterprise 3.2 TB NVMe? What's the procedure to do so?

What would you like to relocate? If your talking bout the monitors, theres no need to put them on different disk. Just make sure you have enterprise ssd with powerloss protection. I would personaly use a second disk and use ZFS Mirror.

acron1234 said:
Question 3: For 1 x manager, should I relocate the storage from default location which is boot OS SSD to the write intensive enterprise 3.2 TB NVMe? What's the procedure to do so?

see above.

acron1234 said:
Question 4: What's the best way for me to get most performance out of this cluster? (This cluster will mainly serve as high performance file system for large block read and write, there will not be used for VM or containers)

You wanna use cephfs only? When its not for vm/ct why not using ceph upstream?

alexskysilk · May 26, 2023

acron1234 said:
Question 1: I plan to create 24 OSDs on each server, each OSD will be a dedicate 16 TB HDD. Do I use the same HDD for WAL Disk or I should use partitions on the write intensive enterprise 3.2 TB NVMe?

you CAN use the same HDD for WAL. It is the safest thing to do, but also the least performant. if you do want to use a separate WAL, rule of thumb is 4% WAL to data, no more than 4 WAL partitions per OSD. with 16tb disks, that would be 1x 640GB per osd, 4 osd's in 3.2TB per db nvme increments; you'll need 6 nvme/per OSD node.

Understand, there are no easy tools to manage the partitions in case of a failed OSD- its easy to screw up and lose all OSDs associated with the db drive on replacement. While not the end of the world, it is a consideration. If you're using it for cephfs, you're gonna want an additional SSD pool for metadata- running metadata on spinners is PAINFULLY slow.

acron1234 said:
Question 2: for 3 x monitor, should I relocate the storage from default location which is boot OS SSD to the write intensive enterprise 3.2 TB NVMe? What's the procedure to do so?

This may seem counterintuitive, but you dont really need to worry about this too much. the manager doesnt need permanent storage, and you have multiple monitors. even if one fails its not like any harm is done. might want to have a standby manager in any case

acron1234 said:
Question 4: What's the best way for me to get most performance out of this cluster? (This cluster will mainly serve as high performance file system for large block read and write, there will not be used for VM or containers)

whats the use case? "High performance" is a very overused term and rarely means the same thing to two people. but GENERALLY speaking:

your performance will be the product of:
1. number of links and latency of the private network
2. number of links and latency of the public network
3. number of aggregate client connections (eg, 10 connections will yield much more performance in the aggregate then one)
4. replication groups will yield better random performance, and will scale better
5. erasure code can yield reasonable performance for sequential IO, especially for writes. Erasure code is much more sensitive to rebuild/rebalance storms and will perform poorly when those occur.

last comment- "hard drives" and "high performance" are almost never used in the same sentence.

acron1234 · May 27, 2023

Thank you for the feedback @jsterr @alexskysilk!

We have been utilizing ProxMox ZFS in the past year with good result. With decent server and large JBOD (106 X 16 TB JBOD), we are seeing 2G/s write throughput on 40 Gbe network. However, we also need better performance storage than ZFS alone. That's why I am looking into Ceph based solutions. The goal is setup R740 based Ceph clusters provide better than "ZFS" performance with horizontal growth and no single point of failure. Our application requires lots of client nodes write to the back end storage, we are writing to multiple ZFS servers at the moment, but it will be nice to have a CephFS type of solutions to pride scale out solutions.

For the metadata pool, can I use a single 3.2 NVMe SSD from each of the 6 nodes? Or I need a dedicate server populated with SSDs for the metadata?

@jsterr what do you mean by ceph upstream?

I am new to ceph, any feedback is greatly apprecaited!

jsterr · May 27, 2023

Its very unlikely that youll beat zfs performance with ceph unless you have lots of nodes with lots of disk (that are nvmes)
What are the performance values you wanna reach?

Edit: ceph upstream means: not using proxmox for ceph, when you are not doing virtualiziation on it. means ceph without proxmox.

acron1234 · May 27, 2023

We have 40 Gbe storage network. The goal is create a Ceph pool with 3G/S for read and write. We have couple dozen R740 with 24 x 16 TB each. How many nodes I need to achieve that kind of performance? (assume each server will have a 3.2 NVMe SSD for the metadata pool)

alexskysilk · May 27, 2023

acron1234 said:
For the metadata pool, can I use a single 3.2 NVMe SSD from each of the 6 nodes? Or I need a dedicate server populated with SSDs for the metadata?

drives dont need their own dedicated hosts, but 6x 3.2TB disks is massive overkill for metadata

Search

Search

Ceph sizing recommendation

acron1234

New Member

jsterr

Renowned Member

alexskysilk

Distinguished Member

acron1234

New Member

jsterr

Renowned Member

acron1234

New Member

alexskysilk

Distinguished Member

We value your privacy