Ceph - question before first setup, one pool or two pools

ilia987 · Oct 24, 2019

hi,
i just migrated our proxmox cluster to v6 and now i am starting to plan and repurchase hardware to replace\upgrade our storage to ceph
till now we had qnap filer servers. and it is time to go ceph

we have two major pools:

lxc\vm\db containers (total size is small, few TB and growing slowly )
very read intestine file storage (1-10gb) will take most of the storage eventual should grow up to few hundred TB

we are going to get (at first stage approx 20 4tb u2 nvme-ssd and split into 5 dedicates servers)
my main question is how to split the storage

the first pool for containers must be as redundant as possible so 2 (or 3 ) replicas
the second pool 2 replicas is more then enough

should i split it from the beginning into two pools? or make 1 larger pool?

in the future the next batch of ssds might be different (probably not the same speed but same capacity ) what should i do then?

Alwin · Oct 24, 2019

ilia987 said:
we are going to get (at first stage approx 20 4tb u2 nvme-ssd and split into 5 dedicates servers)
my main question is how to split the storage

Run it through pgcalc for this. https://ceph.com/pgcalc/

ilia987 said:
the first pool for containers must be as redundant as possible so 2 (or 3 ) replicas
the second pool 2 replicas is more then enough

A size of 2 is never a good idea (especially in small setups), the chance that a subsequent failure might kill the remaining PG and the data is lost are high.

ilia987 said:
should i split it from the beginning into two pools? or make 1 larger pool?

This depends on the requirements, you can use class based rules to reflect the disk type and let Ceph distribute the data accordingly.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

In general, see our preconditions for Ceph.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

ilia987 · Oct 24, 2019

Alwin said:
Run it through pgcalc for this. https://ceph.com/pgcalc/

ok thanks.

Alwin said:
A size of 2 is never a good idea (especially in small setups), the chance that a subsequent failure might kill the remaining PG and the data is lost are high.

I know, we still need to think about it cause it is a big difference in available size (because we are based on ssd and not hdd and the price of the storage already takes a major part of the entire system).
We can recover that data quite easily. This pool will be used mainly for simulation task to enhance the read speed of our worker (grid) nodes.

Alwin said:
This depends on the requirements, you can use class based rules to reflect the disk type and let Ceph distribute the data accordingly.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

this is for later stage.. hopefully someone with more expiration will deal with it, if not ill have to learn it

Alwin said:
In general, see our preconditions for Ceph.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

i think our setup passes all the preconditions

Waiting for everything to arrive and start playing and testing (speed\iops\latency ) , Expecting it will be much faster relative to our current qnap and our freenass servers.

Thanks

Alwin · Oct 24, 2019

ilia987 said:
We can recover that data quite easily. This pool will be used mainly for simulation task to enhance the read speed of our worker (grid) nodes.

If it is throw away data, then the concern is OFC different.

ilia987 said:
this is for later stage.. hopefully someone with more expiration will deal with it, if not ill have to learn it

It is quiet easy to do and Ceph takes care of data distribution.

ilia987 said:
Waiting for everything to arrive and start playing and testing (speed\iops\latency ) , Expecting it will be much faster relative to our current qnap and our freenass servers.

For benchmarks, please see our Ceph benchmark paper and the corresponding forum thread.
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

ilia987 · Oct 24, 2019

Alwin said:
If it is throw away data, then the concern is OFC different.

nope, it is not throw away data, but static data each file 2-4GB
(but stored in another large slow server mainly for backup )

Alwin said:
For benchmarks, please see our Ceph benchmark paper and the corresponding forum thread.
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

i saw them but it based on different setup, and i did not see a multi client benchmark
in our case we will have 5 ceph server and few hundreds of clients each requesting few GB every minute ( the data is needed in bursts and does not spread evenly withing the time)

Ingo S · Oct 24, 2019

This is really interresting. Could you keep me up to date about your performance findings etc. and which HW you used? (SSD Type, Controller Type)
Is this Data large chunks (seq. reads) or is it large amounts of random data(rand read)?

Im interrested in building a separate SSD Pool for enhancing IO capability for IO intensive workloads. Since our Pool is HDD based, its IOps capability is quite low.

ilia987 · Oct 24, 2019

Ingo S said:
This is really interresting. Could you keep me up to date about your performance findings etc. and which HW you used? (SSD Type, Controller Type)
Is this Data large chunks (seq. reads) or is it large amounts of random data(rand read)?

sure

Now it seems that our budget is lower then what i hoped for (ssd only ), so ill might have to go mixed, (hopefully i am wrong)

we dont have much if io issue on the large pool (each job load large sequential file, but we have the throughput issue... )
i guess an average hdd get give 100MB (base is 200MB let assume we have some overhead and the data is not perfectly aligned ), so in peek of 50 simultaneous requests each of 1gb size it is 50GB of data need to passed from 5 filer servers to 50 clients, that's 5GBS , for it i need 50 HDDS each server prevent io load ( i guess ill try to up our budget) not possible (price\value\hardware we need) so back to the drawing board and budget board

any way ill put here a full update when ill receive the hardware, and make some tests

Ingo S · Oct 25, 2019

Yeah that does't quite fit. Our Ceph Cluster consists of 4 Nodes with 8HDDs each. We get an average throughput on sequential writes that saturates our 10G Ethernet Link, if you have enough threads. Single Thread Performance is much lower, ~138MB/s Read, ~96MB/s Write with a 16GB file.

ilia987 · Nov 3, 2019

hi,

before i give the green light and make a large order of hardware i have some questions that i need to understand,

Unfortunately out budget is lower then my expectations, we cant go for ssd(nvme) only storage and we probably will get u2 nvme and 3.5hdd mix

we are getting 5 nodes with something like

2*(12-16) core cpu
12*32GB ram
2-4 nvme u2 each ( for vm\lxc)
6-8 (6 TB each or more) (read only data)
intel optane for proxmox and ceph monitor + journal -the optatne have lower read\write trouput but better latancy and endurance

intel optane - is it possible? to install proxmox and ceph monitor+journal on it?
if ill pick this setup how should it configure the storage? one sdd pool and one hdd pool?

it does not meet our demands, but it its withing our budget, later we will add more nodes.

Alwin · Nov 4, 2019

ilia987 said:
intel optane - is it possible? to install proxmox and ceph monitor+journal on it?

yes.

ilia987 said:
if ill pick this setup how should it configure the storage? one sdd pool and one hdd pool?

Up to you. This is more about the required IOps and space.

ilia987 · Nov 4, 2019

Alwin said:
yes.

Up to you. This is more about the required IOps and space.

On the large pool ( for read only ) we dont need high iops, few thousands should be enough. We care only about throughput,
Is it currect to assue that for each hdd ill add to the poll ill scale at read speed relativity to its actual read speed? for example if the drive give 200MBs read, each drive ill add ill get 150-200MB ?

Alwin · Nov 4, 2019

ilia987 said:
On the large pool ( for read only ) we dont need high iops, few thousands should be enough. We care only about throughput,
Is it currect to assue that for each hdd ill add to the poll ill scale at read speed relativity to its actual read speed? for example if the drive give 200MBs read, each drive ill add ill get 150-200MB ?

Throughput grows with more nodes and OSDs.

ilia987 · Nov 4, 2019

yes, but by how much ?

Alwin · Nov 4, 2019

ilia987 said:
yes, but by how much ?

This depends on the hardware and its setup. Ceph has benchmark tools build in[0] , to test your setup. Ceph does read in parallel. Writes are done to a primary OSD first and this one takes care to distribute the data to the other OSDs (replica). Once all OSDs have written successfully the ACK is returned to the client.

[0] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

ilia987 · Nov 4, 2019

Alwin said:
This depends on the hardware and its setup. Ceph has benchmark tools build in[0] , to test your setup. Ceph does read in parallel. Writes are done to a primary OSD first and this one takes care to distribute the data to the other OSDs (replica). Once all OSDs have written successfully the ACK is returned to the client.

[0] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

we need the high throughput for multiple clients in parallel, we probably will have 5 servers(for ceph) and few hundred clients that perform read actions, i am trying to estimate the read throughput to make sure it will be withing our requirements .

Ingo S · Nov 5, 2019

This is a really difficult task. Ceph reads and writes in parallel and acks when all OSDs have written its copies of that block. That means, if you write a single large file, every single block will be written into an object and assuming you have a 3/2 pool size, this block will get another 2 copies (written in parallel) to different OSDs spread out on the cluster. If that write is ack-ed the next block will be written. So on single File writes, even when its sequential, you will get a performance of about 1HDD (a little less because of ceph overhead).
With every further write operation that is done in parallel to the first, you will gain extra speed in total. The first write will not slow down and every other write operation will be approx as fast as a single drive, until either your ceph network, your controller/PCIe Bandwith of your Server or your disks are saturated with writes, or your IOps Budget is used up if total IOps is lower than needed to saturate your setup, e.g. if write blocks are small.

For write operations this means:

If you need huge throughput with somewhat low parallelism, you need very fast drives. e.g. if only a few users open very large files.
If you have lots of parallel reads/writes your throughput depends on io size and bandwith of your hardware
If IO size is large and sequential, you can get high througput even with Harddisks. Overall throughput grows linearly with parallelism and OSD count
If IO size is small and random, throughput still grows with OSD Count but relies heavily on iops of the disks

For Read operations it is somewhat the same, except that for reads there is no distribution of your block copies to other osds. So reads are generally more parallel than writes.
I did some tests and found that anyway, sequential read performance of a single thread is very slow and most of the time way below the performance of a single disk. As with writes, overall read speed scales with parallelism.

On our Cluster, single threaded seq. read is about 40MB/s with 4 Nodes and a total of 32 HDDs (SATA 7200rpm), with two thread this grows to about 72MB/s, with 10 threads we get 360MB/s. This grows pretty linearly until we reach 1GB/s due to our Network Bandwith of 10Gbits.

ilia987 · Nov 5, 2019

Ingo S said:
If IO size is large and sequential, you can get high througput even with Harddisks. Overall throughput grows linearly with parallelism and OSD count

that is the responce i was looking for

i am leaning forward with ceph over other solutions with the following specs: (all spread over 5 nodes)

read write pool for lxc hosts based on 10x high end u2 nvme ssds over 3GB r/w with over 500k/150k read/write IOPS
read pool (99.9% read 0.1% write) will be based on 30 wd red pro 6TB

and a small optane for jornal proxmox os

my current question is if i go for this setup

what i need to do in order to install proxmox+journal+monitor on the optane (have the lowest latency, and the highest iops )
because when i install proxmox on hdd\sdd it take 100% of the capacity

Ingo S said:
Ceph reads and writes in parallel and acks when all OSDs have written its copies of that block. That means, if you write a single large file, every single block will be written into an object and assuming you have a 3/2 pool size, this block will get another 2 copies (written in parallel) to different OSDs spread out on the cluster. If that write is ack-ed the next block will be written. So on single File writes, even when its sequential, you will get a performance of about 1HDD (a little less because of ceph overhead).

just to understand how ceph work better:

if i write a large file, does the file split across multiple osds? and if so the write speed should be more then single hdd speed?

Ingo S · Nov 5, 2019

ilia987 said:
that is the responce i was looking for

But please be aware that only overall throughput will increase, with the amount of users who access data. Every user will still experience only a data rate that is about equal to single thread performance. I only wanted to make this very clear.

ilia987 said:
my current question is if i go for this setup

what i need to do in order to install proxmox+journal+monitor on the optane (have the lowest latency, and the highest iops )
because when i install proxmox on hdd\sdd it take 100% of the capacity

I don't know of an option that allows you to partition your disk space in the initial setup of PVE. You wil need to install PVE on the NVME drive, then boot some rescue system, resize the filesystems and the lvm, and then use the rest for the other purposes. If you know how to resize LVM and ext4 this might not be a big issue.

Edit: *spelling*

Alwin · Nov 5, 2019

To add to @Ingo S post, don't install Proxmox VE on the NVMe, when used with OSD DB+WAL on the same drive. These are separate concerns and are hard to control together. If there are enough IOps left, after hooking all the OSDs to the NVMe, you can move the MON DB (/var/lib/ceph/ceph-mon) to a partition on the NVMe, the get better latency.

Ingo S · Nov 5, 2019

For installing the PVE OS it will be sufficient, if there is enough space in the case, to just put another small SSD in there. We use 32GB Intel Optane SSDs for the OS with an NVME-> PCIe adapter. If your Servers have an onboard M.2 Slot for such a drive, you could use that instead.

@Alwin
In discussing this topic a question has come to my mind:
Why is sequential single threaded reading that slow? I would assume, ceph prepares reading of sequential blocks from other objects while the previous object is still read. I'm not sure, but i think this is what is called read ahead.
Might it be that this "slowness" is caused by the latency to lookup where the data is (on which OSD, which PG and which Object)? Can moving MON DB improve on this significantly? I'm not even sure where our MON DB is stored right now

...

Ceph - question before first setup, one pool or two pools

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Renowned Member

Active Member

Renowned Member

Proxmox Retired Staff

Renowned Member