extending proxmox cluster and moving to ceph

ilia987

Active Member
Sep 9, 2019
267
13
38
36
We have small cluster based on 10 servers and 2 storages
We are planning to add 8 node supermicro (https://www.supermicro.com/en/products/system/4U/F618/SYS-F618R2-RTN_.cfm)
it will act as ceph server with 2 major pools:
  1. pool fast based on pcievnme for VM,sqlDB, (based on 4tb 2.5ssds )
  2. another pool for read intensive file server (based on large files 100MB-4GB each with total of 20-100 TB (we will add ssds in stages )
the available and not used computation power ( i estimate approx 70% will be available ) will be allocated for a dedicated vm for long computational tasks (cpu intensive jobs)

each node will have
  • 2 x 2TB pcie nvme
  • 2.5 ( up to six for each vm. total 48 at full capacity on 8 nodes )
  • 2 sata doms for proxmox os
  • 40gb networking
  • 256gb ram 10-12core each cpu (2 cpus)

my questions:

  1. Do you think the hardware is good fit for ceph? based on our limited budget of 20k$?
  2. Can i start it with 4 nodes and then add more on demand? (cause now we have only 4 40qsfp ports in the switch and 3 are available. so after ill install 3 nodes and configure the new storage ill deactivate the old storage and configure the 4th node. To work with more the 4 nodes we will have to buy another qsfp switch with will cause us to go over budget.
  3. Will i be able to mix different sizes of ssds (now the best prices are for 4 tb, but i might get good deals of 8tb ssds later )
 
Last edited:
The most accurate information is provided by the documentation of Ceph. Here are the hardware recommendations. Ceph OSD nodes can be added or removed at runtime (more information). Even though using same-sized drives is recommended, Ceph can operate with heterogeneous systems (more information).
 
The most accurate information is provided by the documentation of Ceph. Here are the hardware recommendations. Ceph OSD nodes can be added or removed at runtime (more information). Even though using same-sized drives is recommended, Ceph can operate with heterogeneous systems (more information).
I know. I have read alot. but I couldn't find similer usecase. we plan to buy our soon an I would to have some opinions. for what to change in order to make the best out of the $$$ .

our budget is increased to 30k$.
 
I wouldn't recommend scale-out systems for Ceph, they very much limit the possibility on extending/upgrading these systems. Ceph scales best with increase in nodes and OSDs. Don't use DOM devices for PVE / Ceph, the MON DB will be located there and kill the DOM very quickly.

See our Ceph preconditions for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
 
I wouldn't recommend scale-out systems for Ceph, they very much limit the possibility on extending/upgrading these systems. Ceph scales best with increase in nodes and OSDs. Don't use DOM devices for PVE / Ceph, the MON DB will be located there and kill the DOM very quickly.

See our Ceph preconditions for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
i thought to have 3 or 5 nodes and the beginning (and scale up to 8, and when 8 will not be enough to get another 8 )
what i thought of scale is to add osds until the nodes are full

i can get some high endurance satadom from supermicro the 64 or 128 model with 68/158 TB write
and put 2 on raid, if it wont be good ill have to look for some other hardware
 
i thought to have 3 or 5 nodes and the beginning (and scale up to 8, and when 8 will not be enough to get another 8 )
what i thought of scale is to add osds until the nodes are full

i can get some high endurance satadom from supermicro the 64 or 128 model with 68/158 TB write
and put 2 on raid, if it wont be good ill have to look for some other hardware
Well, that's what I am talking about. The chassis only have 6x 2.5" bays, with the amount of CPU/RAM power it lacks in storage capacity, so it will be highly priced to just add more storage, as this means you will need more nodes. The same for the DOM device, these even if the will take 158TB written, the latency and bandwidth is not up to bar with an SSD.

In general, with Ceph you want to scale horizontal as demand grows or shrinks. This means more nodes with a lesser CPU/memory power but more of them. This way you can always add extra compute nodes or storage (eg. NVMe pool) nodes later on, while not investing upfront with nodes that will not us up their performance.

I hope this made my point clearer. :)
 
Well, that's what I am talking about. The chassis only have 6x 2.5" bays, with the amount of CPU/RAM power it lacks in storage capacity, so it will be highly priced to just add more storage, as this means you will need more nodes. The same for the DOM device, these even if the will take 158TB written, the latency and bandwidth is not up to bar with an SSD.

In general, with Ceph you want to scale horizontal as demand grows or shrinks. This means more nodes with a lesser CPU/memory power but more of them. This way you can always add extra compute nodes or storage (eg. NVMe pool) nodes later on, while not investing upfront with nodes that will not us up their performance.

I hope this made my point clearer. :)
you made it clear but i think you did not understand me

the cpu\ram will not go to waist, it will be added to our hadoop(spark+yarn) cluster ( and the storage is mainly to support our growing cluster)
the main pool(ssd of 4TB each up to 48 of them ) will be read heavy to store the data, so 6 ssds per node at 500MB => 3000MB read = 40GB network, more then 6 ssd OSD will be a potential bottleneck in the network (if i am not mistaken)

the second ssds based on 8 or 16 2tb pcie nvme will be to host the OS (lxc containers),

the satadom ssd have 500MB read and 180 MB write. it is not fast enough?

when 8 nodes(with 6 ssds each, total of 48) will be fully populated and we will need faster storage or more storage we can just add more nodes

we can go for dedicated servers just for storage and dedicated servers for computational tasks,
 
the cpu\ram will not go to waist, it will be added to our hadoop(spark+yarn) cluster ( and the storage is mainly to support our growing cluster)
the main pool(ssd of 4TB each up to 48 of them ) will be read heavy to store the data, so 6 ssds per node at 500MB => 3000MB read = 40GB network, more then 6 ssd OSD will be a potential bottleneck in the network (if i am not mistaken)
Yes, for writing it will be slower, as Ceph doesn't know locality and any write will only be acknowledged, once all copies are written. Reads are done in parallel. This will need an extra network adapter and consumes the LP x16 slot (as micro-LP from Supermicro goes only to 25GbE).

the second ssds based on 8 or 16 2tb pcie nvme will be to host the OS (lxc containers),
No, these are shared, either 6x 2.5" Hot-swap SATA or 4x 2.5" Hot-swap SATA + 2x SATA/NVMe hybrid.

the satadom ssd have 500MB read and 180 MB write. it is not fast enough?
Compared to an SSD the numbers are not on par, but most likely it will be a latency issue. As Ceph MON DB, OS and other services will share this device.

we can go for dedicated servers just for storage and dedicated servers for computational tasks,
This sounds to me more the way to go, as I get the feeling that the application that will run on these nodes will need lots of CPU/memory. This way concerns will be separated and the storage will not interfere with the application and vice versa.
 
Yes, for writing it will be slower, as Ceph doesn't know locality and any write will only be acknowledged, once all copies are written. Reads are done in parallel. This will need an extra network adapter and consumes the LP x16 slot (as micro-LP from Supermicro goes only to 25GbE).
ill hade 2x40GB network card one for read\write and one for sync,

No, these are shared, either 6x 2.5" Hot-swap SATA or 4x 2.5" Hot-swap SATA + 2x SATA/NVMe hybrid.
i cant connect 6 ssds and 2 nvme simultaneously ?

Compared to an SSD the numbers are not on par, but most likely it will be a latency issue. As Ceph MON DB, OS and other services will share this device.
i don't have a solution for it

This sounds to me more the way to go, as I get the feeling that the application that will run on these nodes will need lots of CPU/memory. This way concerns will be separated and the storage will not interfere with the application and vice versa.
i try to avoid it cause it will cost more,
i don't think we have a budget to separate entirely storage for executor

what do you think about the following (i am trying to get quota for Supermicro for 5 nodes at initial stage based on something like 1U SYS-1028U-TR4T+ )
ill put 1 core 1gb untouched per osd for ceph that the reset will be allocated to the lxc container
  • 10x ssd ( 2 will be in raid for proxmox\mon) 8 for ceph
  • 1x pcie nvme as local scratch for worker lxc - worker lxc (consume cpu\ram)
  • 2x pcie nvme as to ceph pool
  • 2x cpu (12-18 core each)
  • 256-384 gb ram (depend on core count)
  • dual port 40gb network card
 
ill hade 2x40GB network card one for read\write and one for sync,
You mean the cluster/public network separation of Ceph?

i cant connect 6 ssds and 2 nvme simultaneously ?
No, the last two slots of the chassis are shared.

i don't have a solution for it
The new option seems more promising.

i try to avoid it cause it will cost more,
i don't think we have a budget to separate entirely storage for executor
A good part of the money goes into the density of the system. 8x nodes in 4U sure cost more, than 8x nodes on 8x1U.

what do you think about the following (i am trying to get quota for Supermicro for 5 nodes at initial stage based on something like 1U SYS-1028U-TR4T+ )
ill put 1 core 1gb untouched per osd for ceph that the reset will be allocated to the lxc container
This looks like a more flexible option. For Ceph OSDs, they have a memory target 4 GB, and it is recommended 1GB memory for 1 TB of stored data. It might need more RAM than 1GB. ;)

Please check out our Ceph preconditions, there are those requirements written.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
 
You mean the cluster/public network separation of Ceph?
40 Gb for ceph sync
40Gb for public access read\write

No, the last two slots of the chassis are shared.
good to know, i guess i missed it


This looks like a more flexible option. For Ceph OSDs, they have a memory target 4 GB, and it is recommended 1GB memory for 1 TB of stored data. It might need more RAM than 1GB. ;)
yep my mistake, ram is the easiest and cheapest solution cause 8 ssds of 4tb it is 32GB or 64GB more

thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!