CEPH: How to find the best start being prepared to select Hardware

guyman

Renowned Member
Jan 2, 2014
28
1
68
Hello everybody,


We are currently faced with deciding what a possible new storage concept might look like. Unfortunately, we can only "rely" on what we have found on the internet at Howtos and information about CEPH.


What do we want to achieve:


- I'd like to have a growing cluster for both virtual machines and storage

- I want a reasonable throughput, both in MB / s reading and writing, as well as of course the IOPS, especially writes

In addition, I want to use CEPHFS to have a redundant file system, which is what I like to do. as a backend in different VMs can use in parallel


What we want to start with:


For the experiment (what else is it not initially) I would like to start as follows:


- 3 servers, both 2x2640v4 Xeon, 256GB RAM, 6x10 + 2x40Gb, each 2 x 6.4TB Samsung 1725b, booted from 2 x SS300 Enterprise SSD

- all 3 servers have the role of CEPH OSD, CEPH Manager and CEPH Monitor at the beginning - but only at the beginning, in addition I would like to leave a few VM on it, otherwise completely overdose

- We have thus 6 x OSD in 3 x Nodes, more than 2 x NVMe via PCI Express I would like or I can not "obstruct", because otherwise it will be too warm / tight in Supermicro Chasis

- I want to redundantly build the network over LCAP and 2 x Juniper QFX5100 switches, i. per server there are 2 x 40 GBit for the CEPH OSD Cluster, 2 x 10 GBit for Heartbeat and Corosync, 2 x 10 GBit for the internal communication network


My questions:


- A replicaset of 2 should allow me that in case of total failure of a node still the cluster is still available and although safe and writable, so I would have at 2 x 6.4 TB per node then at 3 nodes and 2 replica replica about 12 , 8 TB of space, but of which I should fill only about 80% maximum - right?


- 2 x NVMe per node? 4, I think are certainly better for distribution I / O, but I'm afraid it will be too tight in the case and the heat is not too negligible


Later I would like to outsource the 3 Monitoring / Manager-Nodes to 3 small servers, and hang these only by 2x10 GBit on the internal network, since one should use only for the OSD network layer 40 GBit +. In my opinion, this should not be a boat leash? Separated because the monitoring nodes seem very very important to me ;-).


Thank you for your experience.


Greeting,

Ronny
 
Last edited:
In our docs [0] you can find a rough guide on what to look for in hardware. And for comparison with our Ceph Benchmark Paper and user contributed results, see the corresponding thread [2].

- A replicaset of 2 should allow me that in case of total failure of a node still the cluster is still available and although safe and writable, so I would have at 2 x 6.4 TB per node then at 3 nodes and 2 replica replica about 12 , 8 TB of space, but of which I should fill only about 80% maximum - right?
Do not go with two replicas in small cluster, use three. As any subsequent failure can result in data loss. As an example, it might be that a node failed, while there would be only one copy on another node left, the disk where that copy is stored might just die too. Also that last copy could be in-flight and not be written on any disk.

[0] https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
[1] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
[2] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
@Alwin Thanks for this, I already read it as far as I could understand it.
I plan to use 3 Nodes, each with 2 x 6,4 TB NVMe ;-). Should I split one NVMe in two OSD to reach the best performance?
 
I plan to use 3 Nodes, each with 2 x 6,4 TB NVMe ;-). Should I split one NVMe in two OSD to reach the best performance?
This depends on the type of NVMe, but it is sure worth to try.
 
@Alwin reading the docs along CEPH on their site, they suppose to split ONE nvme into 4 OSD. So I will try it out.
 
@Alwin Is there any hint how to split one NVMe into 2 / 4 OSDs? I could not find any hint which could help me a little on the WWW, as far as I understood, working with partitions on the NVMe is not a good idea ?
 
ceph-volume lvm batch -h has an option for multiple OSDs on one device.
 
Ok, as written there, they suggest to split into 2 pieces, and in some other document, they told into 4 :) But I think starting with 2 OSD on 1 NVMe, meaning 4 OSD on 2 NVMe on one node, this should be ok. Replica-Set setting to 3 is best I suppose, and CEPH knows not to use all the OSD on one node ;)?

However, is there already a feature to split drives on the GUI, or just pure bash ;)?
 
Ok, as written there, they suggest to split into 2 pieces, and in some other document, they told into 4 :) But I think starting with 2 OSD on 1 NVMe, meaning 4 OSD on 2 NVMe on one node, this should be ok. Replica-Set setting to 3 is best I suppose, and CEPH knows not to use all the OSD on one node ;)?
In my tests, I couldn't gain more throughput fater two OSDs on an Intel DC P3700. By default the replication level is host, so that only one copy resides on a host.

However, is there already a feature to split drives on the GUI, or just pure bash ;)?
For now it will stay purely CLI.
 
I will try both, one and two OSD on one NVME and then I will do some benchmarks and publish them here for everyone.
 
Hi, we still use the env for some hard tests,
however, I run into a silly question;

using my CEPH as a backend for storage, on the usage stats, it shows the assigned, but not data really be in use statistic only,
so it always reserves the complete assigned space?
Thanks
 
using my CEPH as a backend for storage, on the usage stats, it shows the assigned, but not data really be in use statistic only,
so it always reserves the complete assigned space?
I don't understand. Can you post some output?
 
@Alwin
in CEPH usage stats, on the GUI, it reports 2 TB full, for example, if I assign 2 TB to a VM,
but on this VM, only 2 GB :) are in use. So this is a little confusing.
 
in CEPH usage stats, on the GUI, it reports 2 TB full, for example, if I assign 2 TB to a VM,
but on this VM, only 2 GB :) are in use. So this is a little confusing.
You need to use TRIM/Discard to reclaim the freed space.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!