Number of nodes recommended for a Proxmox Cluster with Ceph

alessice · Jan 23, 2019

Hi,

how many nodes do you reccomend for a Proxmox Cluster with Ceph (HCI mode)? We would like to start with 4 node with 6 SSD disk each, so will have 6 OSD per node and PVE OS on SATA Dom.

The other option is 8 node with the same 6 SSD disk on each.

Is fine to start with 4 node? Somebody suggest me to not use 4 node for possible quorum problems.

Thanks

tim · Jan 23, 2019

That depends on what you intend to do with it, more nodes mean you have more nodes to work with.

It is not really an issue to have 4 nodes but like in all quorum based systems you don't gain more redundancy with an even number of nodes. You need to keep a majority to make decisions, so in case of 4 nodes you can lose just 1 node and that's the same with a 3 node cluster. On the contrary if you have 5 nodes you can lose 2 of them and still have a majority.

3 nodes -> lose 1 node still quorum - > lose 2 nodes no quorum
4 nodes -> lose 1 node still quorum -> lose 2 nodes no quorum
5 nodes -> lose 1 or 2 nodes still quorum -> lose 3 nodes no quorum
...

So as you can see, the redundancy increases with 5 nodes (that would be the same with 7, 9 and so on) that's why they suggested not to use a even number.

alessice · Jan 23, 2019

Thanks Tim,

I'm evaluating 4 or 8 node because hardware will be SuperMicro Twin where for each 2U case we have 4 server inside.

With 8 node with 6 SSD each we will have a total of 48 SSD (so will be able to setup 48 OSD). Is this a good configuration for a PVE 5.3 in HCI mode with Ceph?

tim · Jan 23, 2019

If you have the opportunity to get 8 of them and the workload will remain the same, that would undoubtedly be better. As it is always better to have more, but as said in my first post it's all about the intended use case. If someone asks me if he should get 4 or 8 I would recommend to get 8 no questions asked, but that's not a possibility most of the time and maybe sometimes not necessary, but better have it than lack it.
Maybe you can give us a hint what you are going to do with it?

alessice · Jan 23, 2019

Currently we have VMs (for now 40 but will grow) hosted by an Hosting Provider. We need to migrate from a public cloud to a private infrastructure so we are evaluating PVE. We are not interesting into have a dedicated external storage via iSCSI, NFS so we are looking for Ceph.

PVE node will be (x8):

2 x Intel Xeon 10 core
196GB of RAM
6 SSD Intel (480 or 960GB)
2 SATA Dom for OS
4 x 10 Gbit Ethernet

tim · Jan 23, 2019

With 6x480GB you will end up with about ~6.8TB usable space considering a replica of 3. That's about 168 GB per VM, which could be to less, but maybe it's enough for you. With 960GB you will end up, as you might guess, with about 13.5TB.
I don't know the exact model of your SSD's but if I assume a lower speed of 500MB/s and you have 6 of them this means if all of them write you will end up with 24Gb/s. Take this into account when thinking about your network setup, if your SSD's are faster you could have a bottleneck in your network setup.

udo · Jan 23, 2019

tim said:
... With 960GB you will end up, as you might guess, with about 13.5TB.

Hi,
this is an very optimistic calculation, because ceph run's in trouble if an osd are nearly filled.
With enough OSDs you should not filled more than 70% (to left space free for an failed osd/node).
With this calculation you can use 9.8TB with 960GB SSDs (0.87.. TB in real life).

Code:

6*8*0.873046875/3*0.7
9.778125

Udo

tim · Jan 24, 2019

With my calculation ceph would recover if 1 node fails, in udo's example you have more fault tolerance thanks for pointing that out!
In both examples if something fails it's not meant to be ignored, so you have to act anyway (replacing osds, nodes..).
The whole point with this is you need that extra space so ceph can redistribute the faulty osd's to the remaining ones to get back to 3 replicas.
You could use even more than 13,5 TB, which I definitely don't recommend, but in an error case your ceph pool won't recover because it can not replicate anymore.

alessice · Jan 24, 2019

Thanks to all for informations. Our VM are small but after your example I understand that 480GD for SSD are too small, so we evaluate at least 960GB SSD.

alessice · Feb 11, 2019

Hi,

based on my budget I have update my configuration with 6 x 1.92TB SSD Intel D3-S4510 on each of 8 nodes for a total of 48 SSD and 192GB of RAM per node.

My question is, how usable space can I consider for a safe enviroment with x3 replica?

For RAM, can I consider 64GB of reserved Ram for Ceph and Proxmox operations and 128GB available for VM?

Thanks

alexskysilk · Feb 11, 2019

alessice said:
based on my budget I have update my configuration with 6 x 1.92TB SSD Intel D3-S4510 on each of 8 nodes for a total of 48 SSD and 192GB of RAM per node.

Odds are that with this config, each system's SSDs will cost roughly half the system (node) cost. I would suggest that this may not be the best way to use your budget- not all your VM data needs to be on ceph SSDs, especially considering you have 40VMs or so. I would suggest taking a hard look at your data needs and deploy a NAS for the non time critical data/media/etc- this will allow you to deploy 4 nodes of ceph OSD nodes plus a fifth, full size node housing large HDDs. This node can serve your slow storage via nfs/iscsi/smb, PLUS act as a monitor- or, if you're really paranoid about sustaining two node failure it can house a mix of OSDs and HDD for NAS purposes simultaneously but I wouldnt suggest it.

alessice · Feb 12, 2019

Hi alexskysilk,

thanks for your suggestions that are probably true. SSD are half of my budget.

I update my configuration like this for each of 8 nodes:

CPU 2 x Intel Xeon 4114 10C/20T
RAM 12 x 16GB (192GB)
6 x SSD Intel D3-S4510 960GB
4 x 10Gbit SPF
2 x 128GB SATADOM for Proxmox

and create 64 VMs each with about:

4 vCPU
16GB RAM
150GB HDD

with a replica of 3 for ceph and ability to works with 2 nodes failed.

I evaluated an external storage like NetApp, that we already use, but I prefer all data on SSD, and NetApp is very expensive.

What do you think about?
Thanks

alexskysilk · Feb 12, 2019

alessice said:
What do you think about?

Your VM projection doesnt address how much you'll actually use, which means I cant project how much you can overprovision. My thoughts cant be relevant until I understand
1. How much space will you be using for ESSENTIAL (read: boot os and database) data
2. How much space do you need for NON ESSENTIAL (read: media, non latency critical data)
3. What is your projected growth rate for both
4. How easy/difficult would it be for you to add nodes in the future (is the cluster locally on prem for you, is it in a colo 2000miles away, etc)

edit- ignore the below

[̶Q̶U̶O̶T̶E̶=̶"̶a̶l̶e̶s̶s̶i̶c̶e̶,̶ ̶p̶o̶s̶t̶:̶ ̶2̶3̶9̶1̶1̶5̶,̶ ̶m̶e̶m̶b̶e̶r̶:̶ ̶3̶4̶1̶9̶5̶"̶]̶w̶i̶t̶h̶ ̶a̶ ̶r̶e̶p̶l̶i̶c̶a̶ ̶o̶f̶ ̶3̶ ̶f̶o̶r̶ ̶c̶e̶p̶h̶ ̶a̶n̶d̶ ̶a̶b̶i̶l̶i̶t̶y̶ ̶t̶o̶ ̶w̶o̶r̶k̶s̶ ̶w̶i̶t̶h̶ ̶2̶ ̶n̶o̶d̶e̶s̶ ̶f̶a̶i̶l̶e̶d̶.̶[̶/̶Q̶U̶O̶T̶E̶]̶ ̶A̶ ̶3̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶ ̶w̶i̶l̶l̶ ̶n̶o̶t̶ ̶L̶O̶S̶E̶ ̶d̶a̶t̶a̶ ̶w̶i̶t̶h̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶s̶ ̶d̶o̶w̶n̶,̶ ̶i̶t̶ ̶j̶u̶s̶t̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶ ̶w̶o̶n̶t̶ ̶b̶e̶ ̶b̶a̶c̶k̶ ̶i̶n̶ ̶p̶r̶o̶d̶u̶c̶t̶i̶o̶n̶ ̶u̶n̶t̶i̶l̶ ̶y̶o̶u̶ ̶b̶r̶i̶n̶g̶ ̶u̶p̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶o̶n̶e̶ ̶m̶o̶r̶e̶ ̶n̶o̶d̶e̶.̶ ̶t̶o̶ ̶b̶e̶ ̶a̶b̶l̶e̶ ̶t̶o̶ ̶s̶u̶s̶t̶a̶i̶n̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶ ̶f̶a̶i̶l̶u̶r̶e̶s̶ ̶a̶n̶d̶ ̶r̶e̶m̶a̶i̶n̶ ̶i̶n̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶ ̶y̶o̶u̶ ̶w̶o̶u̶l̶d̶ ̶n̶e̶e̶d̶ ̶a̶ ̶4̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶,̶ ̶w̶h̶i̶c̶h̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶'̶l̶l̶ ̶o̶n̶l̶y̶ ̶g̶e̶t̶ ̶2̶5̶%̶ ̶u̶t̶i̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶f̶r̶o̶m̶ ̶y̶o̶u̶r̶ ̶d̶i̶s̶k̶s̶.̶ ̶T̶h̶e̶ ̶o̶n̶l̶y̶ ̶o̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶o̶ ̶a̶c̶h̶i̶e̶v̶e̶ ̶t̶h̶i̶s̶ ̶l̶e̶v̶e̶l̶ ̶o̶f̶ ̶f̶a̶u̶l̶t̶ ̶t̶o̶l̶e̶r̶a̶n̶c̶e̶ ̶i̶s̶ ̶b̶y̶ ̶u̶s̶i̶n̶g̶ ̶e̶r̶a̶s̶u̶r̶e̶ ̶c̶o̶d̶e̶d̶ ̶g̶r̶o̶u̶p̶s̶ ̶w̶h̶i̶c̶h̶ ̶p̶r̶o̶x̶m̶o̶x̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶f̶o̶r̶ ̶m̶a̶i̶n̶ ̶d̶i̶s̶k̶s̶,̶ ̶b̶u̶t̶ ̶c̶o̶u̶l̶d̶ ̶b̶e̶ ̶w̶o̶r̶k̶a̶b̶l̶e̶ ̶f̶o̶r̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶d̶i̶s̶k̶ ̶V̶M̶s̶ ̶(̶a̶n̶d̶ ̶m̶a̶y̶ ̶b̶e̶ ̶a̶ ̶p̶r̶o̶p̶o̶s̶e̶d̶ ̶o̶p̶t̶i̶o̶n̶s̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶h̶o̶w̶ ̶y̶o̶u̶ ̶a̶n̶s̶w̶e̶r̶ ̶t̶h̶e̶ ̶a̶b̶o̶v̶e̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶.̶)̶

Ronny Aasen · Feb 13, 2019

alessice said:
Hi alexskysilk,

thanks for your suggestions that are probably true. SSD are half of my budget.

I update my configuration like this for each of 8 nodes:

CPU 2 x Intel Xeon 4114 10C/20T

RAM 12 x 16GB (192GB)

6 x SSD Intel D3-S4510 960GB

4 x 10Gbit SPF

2 x 128GB SATADOM for Proxmox

and create 64 VMs each with about:

4 vCPU

16GB RAM

150GB HDD

with a replica of 3 for ceph and ability to works with 2 nodes failed.

I evaluated an external storage like NetApp, that we already use, but I prefer all data on SSD, and NetApp is very expensive.

What do you think about?
Thanks

if you do need a simple NAS style storage you can do what i do. 8 nodes with some SSD OSD for vm RBD images, and some large spinning osd's for slow storage.

using the osd classes i can place pools on either ssd or spinning disk.
i have these pools
rbd 3x replication on ssd
cephfs-metadata 3x replication on ssd
cephfs-ec-data with 4k+2m erasure coded pool HDD giving you 66% of raw capacity on data on this pool
rbd-hdd with 3x replication on HDD for vm secondary images where the performance does not matter that much.

the cephfs is mounted wherever needed on linux, and mounted and re-exported with samba for windows clients. the samba exporter is just a vm in proxmox.

ceph is very flexible in this way. But that flexibillity have a cost in more complexity. Handeling 2 classes is not that easy in proxmox gui, you will need to use the cli more. and it is added complexity and monitoring since you need to monitor the classes separatly. SSD can fill up while HDD does not, and the "average" ceph fill % is normal, but actualy the SSD pools are ready to vurst....
so if the budget is loose, you may want to keep it simple and use all flash.

Search

Search

Number of nodes recommended for a Proxmox Cluster with Ceph

alessice

Active Member

tim

Proxmox Staff Member

alessice

Active Member

tim

Proxmox Staff Member

alessice

Active Member

tim

Proxmox Staff Member

udo

Distinguished Member

tim

Proxmox Staff Member

alessice

Active Member

alessice

Active Member

alexskysilk

Distinguished Member

alessice

Active Member

alexskysilk

Distinguished Member

Ronny Aasen

Active Member