extending proxmox cluster and moving to ceph

ilia987 · Sep 10, 2019

We have small cluster based on 10 servers and 2 storages
We are planning to add 8 node supermicro (https://www.supermicro.com/en/products/system/4U/F618/SYS-F618R2-RTN_.cfm)
it will act as ceph server with 2 major pools:

pool fast based on pcievnme for VM,sqlDB, (based on 4tb 2.5ssds )
another pool for read intensive file server (based on large files 100MB-4GB each with total of 20-100 TB (we will add ssds in stages )

the available and not used computation power ( i estimate approx 70% will be available ) will be allocated for a dedicated vm for long computational tasks (cpu intensive jobs)

each node will have

2 x 2TB pcie nvme
2.5 ( up to six for each vm. total 48 at full capacity on 8 nodes )
2 sata doms for proxmox os
40gb networking
256gb ram 10-12core each cpu (2 cpus)

my questions:

Do you think the hardware is good fit for ceph? based on our limited budget of 20k$?
Can i start it with 4 nodes and then add more on demand? (cause now we have only 4 40qsfp ports in the switch and 3 are available. so after ill install 3 nodes and configure the new storage ill deactivate the old storage and configure the 4th node. To work with more the 4 nodes we will have to buy another qsfp switch with will cause us to go over budget.
Will i be able to mix different sizes of ssds (now the best prices are for 4 tb, but i might get good deals of 8tb ssds later )

Dominic · Oct 1, 2019

The most accurate information is provided by the documentation of Ceph. Here are the hardware recommendations. Ceph OSD nodes can be added or removed at runtime (more information). Even though using same-sized drives is recommended, Ceph can operate with heterogeneous systems (more information).

ilia987 · Oct 1, 2019

Dominic said:
The most accurate information is provided by the documentation of Ceph. Here are the hardware recommendations. Ceph OSD nodes can be added or removed at runtime (more information). Even though using same-sized drives is recommended, Ceph can operate with heterogeneous systems (more information).

I know. I have read alot. but I couldn't find similer usecase. we plan to buy our soon an I would to have some opinions. for what to change in order to make the best out of the $$$ .

our budget is increased to 30k$.

Alwin · Oct 2, 2019

ilia987 said:
We are planning to add 8 node supermicro (https://www.supermicro.com/en/products/system/4U/F618/SYS-F618R2-RTN_.cfm)

I wouldn't recommend scale-out systems for Ceph, they very much limit the possibility on extending/upgrading these systems. Ceph scales best with increase in nodes and OSDs. Don't use DOM devices for PVE / Ceph, the MON DB will be located there and kill the DOM very quickly.

See our Ceph preconditions for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

ilia987 · Oct 2, 2019

Alwin said:
I wouldn't recommend scale-out systems for Ceph, they very much limit the possibility on extending/upgrading these systems. Ceph scales best with increase in nodes and OSDs. Don't use DOM devices for PVE / Ceph, the MON DB will be located there and kill the DOM very quickly.

See our Ceph preconditions for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

i thought to have 3 or 5 nodes and the beginning (and scale up to 8, and when 8 will not be enough to get another 8 )
what i thought of scale is to add osds until the nodes are full

i can get some high endurance satadom from supermicro the 64 or 128 model with 68/158 TB write
and put 2 on raid, if it wont be good ill have to look for some other hardware

Alwin · Oct 2, 2019

ilia987 said:
i thought to have 3 or 5 nodes and the beginning (and scale up to 8, and when 8 will not be enough to get another 8 )
what i thought of scale is to add osds until the nodes are full

i can get some high endurance satadom from supermicro the 64 or 128 model with 68/158 TB write
and put 2 on raid, if it wont be good ill have to look for some other hardware

Well, that's what I am talking about. The chassis only have 6x 2.5" bays, with the amount of CPU/RAM power it lacks in storage capacity, so it will be highly priced to just add more storage, as this means you will need more nodes. The same for the DOM device, these even if the will take 158TB written, the latency and bandwidth is not up to bar with an SSD.

In general, with Ceph you want to scale horizontal as demand grows or shrinks. This means more nodes with a lesser CPU/memory power but more of them. This way you can always add extra compute nodes or storage (eg. NVMe pool) nodes later on, while not investing upfront with nodes that will not us up their performance.

I hope this made my point clearer.

ilia987 · Oct 2, 2019

Alwin said:
Well, that's what I am talking about. The chassis only have 6x 2.5" bays, with the amount of CPU/RAM power it lacks in storage capacity, so it will be highly priced to just add more storage, as this means you will need more nodes. The same for the DOM device, these even if the will take 158TB written, the latency and bandwidth is not up to bar with an SSD.

In general, with Ceph you want to scale horizontal as demand grows or shrinks. This means more nodes with a lesser CPU/memory power but more of them. This way you can always add extra compute nodes or storage (eg. NVMe pool) nodes later on, while not investing upfront with nodes that will not us up their performance.

I hope this made my point clearer.

you made it clear but i think you did not understand me

the cpu\ram will not go to waist, it will be added to our hadoop(spark+yarn) cluster ( and the storage is mainly to support our growing cluster)
the main pool(ssd of 4TB each up to 48 of them ) will be read heavy to store the data, so 6 ssds per node at 500MB => 3000MB read = 40GB network, more then 6 ssd OSD will be a potential bottleneck in the network (if i am not mistaken)

the second ssds based on 8 or 16 2tb pcie nvme will be to host the OS (lxc containers),

the satadom ssd have 500MB read and 180 MB write. it is not fast enough?

when 8 nodes(with 6 ssds each, total of 48) will be fully populated and we will need faster storage or more storage we can just add more nodes

we can go for dedicated servers just for storage and dedicated servers for computational tasks,

Alwin · Oct 2, 2019

ilia987 said:
the cpu\ram will not go to waist, it will be added to our hadoop(spark+yarn) cluster ( and the storage is mainly to support our growing cluster)
the main pool(ssd of 4TB each up to 48 of them ) will be read heavy to store the data, so 6 ssds per node at 500MB => 3000MB read = 40GB network, more then 6 ssd OSD will be a potential bottleneck in the network (if i am not mistaken)

Yes, for writing it will be slower, as Ceph doesn't know locality and any write will only be acknowledged, once all copies are written. Reads are done in parallel. This will need an extra network adapter and consumes the LP x16 slot (as micro-LP from Supermicro goes only to 25GbE).

ilia987 said:
the second ssds based on 8 or 16 2tb pcie nvme will be to host the OS (lxc containers),

No, these are shared, either 6x 2.5" Hot-swap SATA or 4x 2.5" Hot-swap SATA + 2x SATA/NVMe hybrid.

ilia987 said:
the satadom ssd have 500MB read and 180 MB write. it is not fast enough?

Compared to an SSD the numbers are not on par, but most likely it will be a latency issue. As Ceph MON DB, OS and other services will share this device.

ilia987 said:
we can go for dedicated servers just for storage and dedicated servers for computational tasks,

This sounds to me more the way to go, as I get the feeling that the application that will run on these nodes will need lots of CPU/memory. This way concerns will be separated and the storage will not interfere with the application and vice versa.

ilia987 · Oct 2, 2019

Alwin said:
Yes, for writing it will be slower, as Ceph doesn't know locality and any write will only be acknowledged, once all copies are written. Reads are done in parallel. This will need an extra network adapter and consumes the LP x16 slot (as micro-LP from Supermicro goes only to 25GbE).

ill hade 2x40GB network card one for read\write and one for sync,

Alwin said:
No, these are shared, either 6x 2.5" Hot-swap SATA or 4x 2.5" Hot-swap SATA + 2x SATA/NVMe hybrid.

i cant connect 6 ssds and 2 nvme simultaneously ?

Alwin said:
Compared to an SSD the numbers are not on par, but most likely it will be a latency issue. As Ceph MON DB, OS and other services will share this device.

i don't have a solution for it

Alwin said:
This sounds to me more the way to go, as I get the feeling that the application that will run on these nodes will need lots of CPU/memory. This way concerns will be separated and the storage will not interfere with the application and vice versa.

i try to avoid it cause it will cost more,
i don't think we have a budget to separate entirely storage for executor

what do you think about the following (i am trying to get quota for Supermicro for 5 nodes at initial stage based on something like 1U SYS-1028U-TR4T+ )
ill put 1 core 1gb untouched per osd for ceph that the reset will be allocated to the lxc container

10x ssd ( 2 will be in raid for proxmox\mon) 8 for ceph
1x pcie nvme as local scratch for worker lxc - worker lxc (consume cpu\ram)
2x pcie nvme as to ceph pool
2x cpu (12-18 core each)
256-384 gb ram (depend on core count)
dual port 40gb network card

Alwin · Oct 2, 2019

ilia987 said:
ill hade 2x40GB network card one for read\write and one for sync,

You mean the cluster/public network separation of Ceph?

ilia987 said:
i cant connect 6 ssds and 2 nvme simultaneously ?

No, the last two slots of the chassis are shared.

ilia987 said:
i don't have a solution for it

The new option seems more promising.

ilia987 said:
i try to avoid it cause it will cost more,
i don't think we have a budget to separate entirely storage for executor

A good part of the money goes into the density of the system. 8x nodes in 4U sure cost more, than 8x nodes on 8x1U.

ilia987 said:
what do you think about the following (i am trying to get quota for Supermicro for 5 nodes at initial stage based on something like 1U SYS-1028U-TR4T+ )
ill put 1 core 1gb untouched per osd for ceph that the reset will be allocated to the lxc container

This looks like a more flexible option. For Ceph OSDs, they have a memory target 4 GB, and it is recommended 1GB memory for 1 TB of stored data. It might need more RAM than 1GB.

Please check out our Ceph preconditions, there are those requirements written.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

ilia987 · Oct 2, 2019

Alwin said:
You mean the cluster/public network separation of Ceph?

40 Gb for ceph sync
40Gb for public access read\write

Alwin said:
No, the last two slots of the chassis are shared.

good to know, i guess i missed it

Alwin said:
This looks like a more flexible option. For Ceph OSDs, they have a memory target 4 GB, and it is recommended 1GB memory for 1 TB of stored data. It might need more RAM than 1GB.

yep my mistake, ram is the easiest and cheapest solution cause 8 ssds of 4tb it is 32GB or 64GB more

thanks

Search

Search

extending proxmox cluster and moving to ceph

ilia987

Well-Known Member

Dominic

Proxmox Retired Staff

ilia987

Well-Known Member

Alwin

Proxmox Retired Staff

ilia987

Well-Known Member

Alwin

Proxmox Retired Staff

ilia987

Well-Known Member

Alwin

Proxmox Retired Staff

ilia987

Well-Known Member

Alwin

Proxmox Retired Staff

ilia987

Well-Known Member

We value your privacy