Number of nodes recommended for a Proxmox Cluster with Ceph

alessice

Active Member
Sep 18, 2015
15
1
43
Hi,

how many nodes do you reccomend for a Proxmox Cluster with Ceph (HCI mode)? We would like to start with 4 node with 6 SSD disk each, so will have 6 OSD per node and PVE OS on SATA Dom.

The other option is 8 node with the same 6 SSD disk on each.

Is fine to start with 4 node? Somebody suggest me to not use 4 node for possible quorum problems.

Thanks
 
That depends on what you intend to do with it, more nodes mean you have more nodes to work with.

It is not really an issue to have 4 nodes but like in all quorum based systems you don't gain more redundancy with an even number of nodes. You need to keep a majority to make decisions, so in case of 4 nodes you can lose just 1 node and that's the same with a 3 node cluster. On the contrary if you have 5 nodes you can lose 2 of them and still have a majority.

3 nodes -> lose 1 node still quorum - > lose 2 nodes no quorum
4 nodes -> lose 1 node still quorum -> lose 2 nodes no quorum
5 nodes -> lose 1 or 2 nodes still quorum -> lose 3 nodes no quorum
...

So as you can see, the redundancy increases with 5 nodes (that would be the same with 7, 9 and so on) that's why they suggested not to use a even number.
 
Thanks Tim,

I'm evaluating 4 or 8 node because hardware will be SuperMicro Twin where for each 2U case we have 4 server inside.

With 8 node with 6 SSD each we will have a total of 48 SSD (so will be able to setup 48 OSD). Is this a good configuration for a PVE 5.3 in HCI mode with Ceph?
 
If you have the opportunity to get 8 of them and the workload will remain the same, that would undoubtedly be better. As it is always better to have more, but as said in my first post it's all about the intended use case. If someone asks me if he should get 4 or 8 I would recommend to get 8 no questions asked, but that's not a possibility most of the time and maybe sometimes not necessary, but better have it than lack it.
Maybe you can give us a hint what you are going to do with it?
 
Currently we have VMs (for now 40 but will grow) hosted by an Hosting Provider. We need to migrate from a public cloud to a private infrastructure so we are evaluating PVE. We are not interesting into have a dedicated external storage via iSCSI, NFS so we are looking for Ceph.

PVE node will be (x8):

2 x Intel Xeon 10 core
196GB of RAM
6 SSD Intel (480 or 960GB)
2 SATA Dom for OS
4 x 10 Gbit Ethernet
 
With 6x480GB you will end up with about ~6.8TB usable space considering a replica of 3. That's about 168 GB per VM, which could be to less, but maybe it's enough for you. With 960GB you will end up, as you might guess, with about 13.5TB.
I don't know the exact model of your SSD's but if I assume a lower speed of 500MB/s and you have 6 of them this means if all of them write you will end up with 24Gb/s. Take this into account when thinking about your network setup, if your SSD's are faster you could have a bottleneck in your network setup.
 
... With 960GB you will end up, as you might guess, with about 13.5TB.
Hi,
this is an very optimistic calculation, because ceph run's in trouble if an osd are nearly filled.
With enough OSDs you should not filled more than 70% (to left space free for an failed osd/node).
With this calculation you can use 9.8TB with 960GB SSDs (0.87.. TB in real life).
Code:
6*8*0.873046875/3*0.7
9.778125
Udo
 
With my calculation ceph would recover if 1 node fails, in udo's example you have more fault tolerance thanks for pointing that out!
In both examples if something fails it's not meant to be ignored, so you have to act anyway (replacing osds, nodes..).
The whole point with this is you need that extra space so ceph can redistribute the faulty osd's to the remaining ones to get back to 3 replicas.
You could use even more than 13,5 TB, which I definitely don't recommend, but in an error case your ceph pool won't recover because it can not replicate anymore.
 
Thanks to all for informations. Our VM are small but after your example I understand that 480GD for SSD are too small, so we evaluate at least 960GB SSD.
 
Hi,

based on my budget I have update my configuration with 6 x 1.92TB SSD Intel D3-S4510 on each of 8 nodes for a total of 48 SSD and 192GB of RAM per node.

My question is, how usable space can I consider for a safe enviroment with x3 replica?

For RAM, can I consider 64GB of reserved Ram for Ceph and Proxmox operations and 128GB available for VM?

Thanks
 
  • Like
Reactions: elmacus
based on my budget I have update my configuration with 6 x 1.92TB SSD Intel D3-S4510 on each of 8 nodes for a total of 48 SSD and 192GB of RAM per node.

Odds are that with this config, each system's SSDs will cost roughly half the system (node) cost. I would suggest that this may not be the best way to use your budget- not all your VM data needs to be on ceph SSDs, especially considering you have 40VMs or so. I would suggest taking a hard look at your data needs and deploy a NAS for the non time critical data/media/etc- this will allow you to deploy 4 nodes of ceph OSD nodes plus a fifth, full size node housing large HDDs. This node can serve your slow storage via nfs/iscsi/smb, PLUS act as a monitor- or, if you're really paranoid about sustaining two node failure it can house a mix of OSDs and HDD for NAS purposes simultaneously but I wouldnt suggest it.
 
Hi alexskysilk,

thanks for your suggestions that are probably true. SSD are half of my budget.

I update my configuration like this for each of 8 nodes:
  • CPU 2 x Intel Xeon 4114 10C/20T
  • RAM 12 x 16GB (192GB)
  • 6 x SSD Intel D3-S4510 960GB
  • 4 x 10Gbit SPF
  • 2 x 128GB SATADOM for Proxmox
and create 64 VMs each with about:
  • 4 vCPU
  • 16GB RAM
  • 150GB HDD
with a replica of 3 for ceph and ability to works with 2 nodes failed.

I evaluated an external storage like NetApp, that we already use, but I prefer all data on SSD, and NetApp is very expensive.

What do you think about?
Thanks
 
What do you think about?

Your VM projection doesnt address how much you'll actually use, which means I cant project how much you can overprovision. My thoughts cant be relevant until I understand
1. How much space will you be using for ESSENTIAL (read: boot os and database) data
2. How much space do you need for NON ESSENTIAL (read: media, non latency critical data)
3. What is your projected growth rate for both
4. How easy/difficult would it be for you to add nodes in the future (is the cluster locally on prem for you, is it in a colo 2000miles away, etc)

edit- ignore the below :)
[̶Q̶U̶O̶T̶E̶=̶"̶a̶l̶e̶s̶s̶i̶c̶e̶,̶ ̶p̶o̶s̶t̶:̶ ̶2̶3̶9̶1̶1̶5̶,̶ ̶m̶e̶m̶b̶e̶r̶:̶ ̶3̶4̶1̶9̶5̶"̶]̶w̶i̶t̶h̶ ̶a̶ ̶r̶e̶p̶l̶i̶c̶a̶ ̶o̶f̶ ̶3̶ ̶f̶o̶r̶ ̶c̶e̶p̶h̶ ̶a̶n̶d̶ ̶a̶b̶i̶l̶i̶t̶y̶ ̶t̶o̶ ̶w̶o̶r̶k̶s̶ ̶w̶i̶t̶h̶ ̶2̶ ̶n̶o̶d̶e̶s̶ ̶f̶a̶i̶l̶e̶d̶.̶[̶/̶Q̶U̶O̶T̶E̶]̶ ̶A̶ ̶3̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶ ̶w̶i̶l̶l̶ ̶n̶o̶t̶ ̶L̶O̶S̶E̶ ̶d̶a̶t̶a̶ ̶w̶i̶t̶h̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶s̶ ̶d̶o̶w̶n̶,̶ ̶i̶t̶ ̶j̶u̶s̶t̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶ ̶w̶o̶n̶t̶ ̶b̶e̶ ̶b̶a̶c̶k̶ ̶i̶n̶ ̶p̶r̶o̶d̶u̶c̶t̶i̶o̶n̶ ̶u̶n̶t̶i̶l̶ ̶y̶o̶u̶ ̶b̶r̶i̶n̶g̶ ̶u̶p̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶o̶n̶e̶ ̶m̶o̶r̶e̶ ̶n̶o̶d̶e̶.̶ ̶t̶o̶ ̶b̶e̶ ̶a̶b̶l̶e̶ ̶t̶o̶ ̶s̶u̶s̶t̶a̶i̶n̶ ̶t̶w̶o̶ ̶n̶o̶d̶e̶ ̶f̶a̶i̶l̶u̶r̶e̶s̶ ̶a̶n̶d̶ ̶r̶e̶m̶a̶i̶n̶ ̶i̶n̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶ ̶y̶o̶u̶ ̶w̶o̶u̶l̶d̶ ̶n̶e̶e̶d̶ ̶a̶ ̶4̶/̶2̶ ̶R̶G̶ ̶a̶r̶r̶a̶n̶g̶e̶m̶e̶n̶t̶,̶ ̶w̶h̶i̶c̶h̶ ̶m̶e̶a̶n̶s̶ ̶y̶o̶u̶'̶l̶l̶ ̶o̶n̶l̶y̶ ̶g̶e̶t̶ ̶2̶5̶%̶ ̶u̶t̶i̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶f̶r̶o̶m̶ ̶y̶o̶u̶r̶ ̶d̶i̶s̶k̶s̶.̶ ̶T̶h̶e̶ ̶o̶n̶l̶y̶ ̶o̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶o̶ ̶a̶c̶h̶i̶e̶v̶e̶ ̶t̶h̶i̶s̶ ̶l̶e̶v̶e̶l̶ ̶o̶f̶ ̶f̶a̶u̶l̶t̶ ̶t̶o̶l̶e̶r̶a̶n̶c̶e̶ ̶i̶s̶ ̶b̶y̶ ̶u̶s̶i̶n̶g̶ ̶e̶r̶a̶s̶u̶r̶e̶ ̶c̶o̶d̶e̶d̶ ̶g̶r̶o̶u̶p̶s̶ ̶w̶h̶i̶c̶h̶ ̶p̶r̶o̶x̶m̶o̶x̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶f̶o̶r̶ ̶m̶a̶i̶n̶ ̶d̶i̶s̶k̶s̶,̶ ̶b̶u̶t̶ ̶c̶o̶u̶l̶d̶ ̶b̶e̶ ̶w̶o̶r̶k̶a̶b̶l̶e̶ ̶f̶o̶r̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶d̶i̶s̶k̶ ̶V̶M̶s̶ ̶(̶a̶n̶d̶ ̶m̶a̶y̶ ̶b̶e̶ ̶a̶ ̶p̶r̶o̶p̶o̶s̶e̶d̶ ̶o̶p̶t̶i̶o̶n̶s̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶h̶o̶w̶ ̶y̶o̶u̶ ̶a̶n̶s̶w̶e̶r̶ ̶t̶h̶e̶ ̶a̶b̶o̶v̶e̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶.̶)̶
 
Last edited:
Hi alexskysilk,

thanks for your suggestions that are probably true. SSD are half of my budget.

I update my configuration like this for each of 8 nodes:
  • CPU 2 x Intel Xeon 4114 10C/20T
  • RAM 12 x 16GB (192GB)
  • 6 x SSD Intel D3-S4510 960GB
  • 4 x 10Gbit SPF
  • 2 x 128GB SATADOM for Proxmox
and create 64 VMs each with about:
  • 4 vCPU
  • 16GB RAM
  • 150GB HDD
with a replica of 3 for ceph and ability to works with 2 nodes failed.

I evaluated an external storage like NetApp, that we already use, but I prefer all data on SSD, and NetApp is very expensive.

What do you think about?
Thanks

if you do need a simple NAS style storage you can do what i do. 8 nodes with some SSD OSD for vm RBD images, and some large spinning osd's for slow storage.

using the osd classes i can place pools on either ssd or spinning disk.
i have these pools
rbd 3x replication on ssd
cephfs-metadata 3x replication on ssd
cephfs-ec-data with 4k+2m erasure coded pool HDD giving you 66% of raw capacity on data on this pool
rbd-hdd with 3x replication on HDD for vm secondary images where the performance does not matter that much.

the cephfs is mounted wherever needed on linux, and mounted and re-exported with samba for windows clients. the samba exporter is just a vm in proxmox.

ceph is very flexible in this way. But that flexibillity have a cost in more complexity. Handeling 2 classes is not that easy in proxmox gui, you will need to use the cli more. and it is added complexity and monitoring since you need to monitor the classes separatly. SSD can fill up while HDD does not, and the "average" ceph fill % is normal, but actualy the SSD pools are ready to vurst....
so if the budget is loose, you may want to keep it simple and use all flash.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!