CEPH: How to find the best start being prepared to select Hardware

guyman · Oct 8, 2019

Hello everybody,

We are currently faced with deciding what a possible new storage concept might look like. Unfortunately, we can only "rely" on what we have found on the internet at Howtos and information about CEPH.

What do we want to achieve:

- I'd like to have a growing cluster for both virtual machines and storage

- I want a reasonable throughput, both in MB / s reading and writing, as well as of course the IOPS, especially writes

In addition, I want to use CEPHFS to have a redundant file system, which is what I like to do. as a backend in different VMs can use in parallel

What we want to start with:

For the experiment (what else is it not initially) I would like to start as follows:

- 3 servers, both 2x2640v4 Xeon, 256GB RAM, 6x10 + 2x40Gb, each 2 x 6.4TB Samsung 1725b, booted from 2 x SS300 Enterprise SSD

- all 3 servers have the role of CEPH OSD, CEPH Manager and CEPH Monitor at the beginning - but only at the beginning, in addition I would like to leave a few VM on it, otherwise completely overdose

- We have thus 6 x OSD in 3 x Nodes, more than 2 x NVMe via PCI Express I would like or I can not "obstruct", because otherwise it will be too warm / tight in Supermicro Chasis

- I want to redundantly build the network over LCAP and 2 x Juniper QFX5100 switches, i. per server there are 2 x 40 GBit for the CEPH OSD Cluster, 2 x 10 GBit for Heartbeat and Corosync, 2 x 10 GBit for the internal communication network

My questions:

- A replicaset of 2 should allow me that in case of total failure of a node still the cluster is still available and although safe and writable, so I would have at 2 x 6.4 TB per node then at 3 nodes and 2 replica replica about 12 , 8 TB of space, but of which I should fill only about 80% maximum - right?

- 2 x NVMe per node? 4, I think are certainly better for distribution I / O, but I'm afraid it will be too tight in the case and the heat is not too negligible

Later I would like to outsource the 3 Monitoring / Manager-Nodes to 3 small servers, and hang these only by 2x10 GBit on the internal network, since one should use only for the OSD network layer 40 GBit +. In my opinion, this should not be a boat leash? Separated because the monitoring nodes seem very very important to me ;-).

Thank you for your experience.

Greeting,

Ronny

Alwin · Oct 8, 2019

In our docs [0] you can find a rough guide on what to look for in hardware. And for comparison with our Ceph Benchmark Paper and user contributed results, see the corresponding thread [2].

guyman said:
- A replicaset of 2 should allow me that in case of total failure of a node still the cluster is still available and although safe and writable, so I would have at 2 x 6.4 TB per node then at 3 nodes and 2 replica replica about 12 , 8 TB of space, but of which I should fill only about 80% maximum - right?

Do not go with two replicas in small cluster, use three. As any subsequent failure can result in data loss. As an example, it might be that a node failed, while there would be only one copy on another node left, the disk where that copy is stored might just die too. Also that last copy could be in-flight and not be written on any disk.

[0] https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
[1] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
[2] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

guyman · Oct 8, 2019

@Alwin Thanks for this, I already read it as far as I could understand it.
I plan to use 3 Nodes, each with 2 x 6,4 TB NVMe ;-). Should I split one NVMe in two OSD to reach the best performance?

Alwin · Oct 9, 2019

guyman said:
I plan to use 3 Nodes, each with 2 x 6,4 TB NVMe ;-). Should I split one NVMe in two OSD to reach the best performance?

This depends on the type of NVMe, but it is sure worth to try.

guyman · Oct 9, 2019

@Alwin reading the docs along CEPH on their site, they suppose to split ONE nvme into 4 OSD. So I will try it out.

guyman · Oct 9, 2019

@Alwin Is there any hint how to split one NVMe into 2 / 4 OSDs? I could not find any hint which could help me a little on the WWW, as far as I understood, working with partitions on the NVMe is not a good idea ?

Alwin · Oct 9, 2019

ceph-volume lvm batch -h has an option for multiple OSDs on one device.

guyman · Oct 9, 2019

Ok, as written there, they suggest to split into 2 pieces, and in some other document, they told into 4

But I think starting with 2 OSD on 1 NVMe, meaning 4 OSD on 2 NVMe on one node, this should be ok. Replica-Set setting to 3 is best I suppose, and CEPH knows not to use all the OSD on one node

?

However, is there already a feature to split drives on the GUI, or just pure bash

?

Alwin · Oct 9, 2019

guyman said:
Ok, as written there, they suggest to split into 2 pieces, and in some other document, they told into 4 But I think starting with 2 OSD on 1 NVMe, meaning 4 OSD on 2 NVMe on one node, this should be ok. Replica-Set setting to 3 is best I suppose, and CEPH knows not to use all the OSD on one node ?

In my tests, I couldn't gain more throughput fater two OSDs on an Intel DC P3700. By default the replication level is host, so that only one copy resides on a host.

guyman said:
However, is there already a feature to split drives on the GUI, or just pure bash ?

For now it will stay purely CLI.

guyman · Oct 9, 2019

I will try both, one and two OSD on one NVME and then I will do some benchmarks and publish them here for everyone.

guyman · Mar 20, 2020

Hi, we still use the env for some hard tests,
however, I run into a silly question;

using my CEPH as a backend for storage, on the usage stats, it shows the assigned, but not data really be in use statistic only,
so it always reserves the complete assigned space?
Thanks

Alwin · Mar 20, 2020

guyman said:
using my CEPH as a backend for storage, on the usage stats, it shows the assigned, but not data really be in use statistic only,
so it always reserves the complete assigned space?

I don't understand. Can you post some output?

guyman · Mar 20, 2020

@Alwin
in CEPH usage stats, on the GUI, it reports 2 TB full, for example, if I assign 2 TB to a VM,
but on this VM, only 2 GB

are in use. So this is a little confusing.

Alwin · Mar 20, 2020

guyman said:
in CEPH usage stats, on the GUI, it reports 2 TB full, for example, if I assign 2 TB to a VM,
but on this VM, only 2 GB are in use. So this is a little confusing.

You need to use TRIM/Discard to reclaim the freed space.

guyman · Mar 20, 2020

@Alwin
F***; I forgot to enable discard *beng* and to enable trim
Sorry for bothering you

Stay healthy everyone!

Search

Search

CEPH: How to find the best start being prepared to select Hardware

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member

Alwin

Proxmox Retired Staff

guyman

Renowned Member