How many nodes in a single cluster with Ceph requirement?

cocobanana

Active Member
Jun 18, 2021
33
4
28
38
Hi,

I have been using proxmox quite sometimes and the performance was very impressive. I decided to provide 500 VMs to 1000VMs with cluster with ceph( Only linux and windows KVM )

I planned to use Dell Poweredge R630 and NVMe u.2 disk.

I would appreciate if anyone of us can share the best requirement for this in terms of how many nodes required, how many disks , the setup e.t.c

Thanks!
 
its an innocent enough question- but there are a lot of gotchas you need to consider.

Clusters are made up of 3 elements- compute, storage, and networking. lets touch on each.

COMPUTE:
- Dell R630 is a 10 year old platform. as such, it offers pretty poor performance/watt. Do you already know where you are deploying this solution? how much power and cooling are provided?
- Ignore the VM count for a moment. How much cpu load will the TOTAL cluster workload be in terms of core-ghz? you need to account for typical and max, and excess capacity for failover.
- Add a core and 4G of ram per every OSD since it appears you intend to use this as an HCI.
- once you add it all up, you'll have an idea of how many servers you will be deploying.

STORAGE:
- What is the minimum required usable capacity? you should be prepared for 4x RAW. a smallish number of high capacity OSDs could work, but you gain better performance with higher OSD count.
- Dell R630s support up to 4 NVMEs but only on the 10 drive models. Since you cant get these new, be aware that most 10 drive models you will find in the wild dont actually have NVME support and you will have to buy and install it separately.

NETWORKING:
- ideally you need separate interfaces for ceph, ceph private, cluster, and service networks. BMC too. Be aware that R630 is a PCIe gen3 platform which means your maximum practical link speed is 100gbe, and some of your pci lanes will be consumed by your nvmes (16 lanes total) so 4x25g is a good practical configuration for this generation of hardware. you'll want to get your port count based on your node count, and then provision 2 switches that can accommodate half each.

There is a lot more to consider, but this should give you a starting point.
 
its an innocent enough question- but there are a lot of gotchas you need to consider.

Clusters are made up of 3 elements- compute, storage, and networking. lets touch on each.

COMPUTE:
- Dell R630 is a 10 year old platform. as such, it offers pretty poor performance/watt. Do you already know where you are deploying this solution? how much power and cooling are provided?
- Ignore the VM count for a moment. How much cpu load will the TOTAL cluster workload be in terms of core-ghz? you need to account for typical and max, and excess capacity for failover.
- Add a core and 4G of ram per every OSD since it appears you intend to use this as an HCI.
- once you add it all up, you'll have an idea of how many servers you will be deploying.

STORAGE:
- What is the minimum required usable capacity? you should be prepared for 4x RAW. a smallish number of high capacity OSDs could work, but you gain better performance with higher OSD count.
- Dell R630s support up to 4 NVMEs but only on the 10 drive models. Since you cant get these new, be aware that most 10 drive models you will find in the wild dont actually have NVME support and you will have to buy and install it separately.

NETWORKING:
- ideally you need separate interfaces for ceph, ceph private, cluster, and service networks. BMC too. Be aware that R630 is a PCIe gen3 platform which means your maximum practical link speed is 100gbe, and some of your pci lanes will be consumed by your nvmes (16 lanes total) so 4x25g is a good practical configuration for this generation of hardware. you'll want to get your port count based on your node count, and then provision 2 switches that can accommodate half each.

There is a lot more to consider, but this should give you a starting point.

Thank you for the advise.
COMPUTE:
- Dell R630 is a 10 year old platform. as such, it offers pretty poor performance/watt. Do you already know where you are deploying this solution? how much power and cooling are provided?
Answer : All the nodes will be put in 52u Racks in the tier-3 datacenter with a 5 - 10kw power.

- Ignore the VM count for a moment. How much cpu load will the TOTAL cluster workload be in terms of core-ghz? you need to account for typical and max, and excess capacity for failover.
Answer: I planned to use 2 x Intel Xeon E5 v4 2699

- Add a core and 4G of ram per every OSD since it appears you intend to use this as an HCI.
ANswer: This will be 384GB to 512GB RAM on each node

- once you add it all up, you'll have an idea of how many servers you will be deploying.


STORAGE:
- What is the minimum required usable capacity? you should be prepared for 4x RAW. a smallish number of high capacity OSDs could work, but you gain better performance with higher OSD count.
Answer: Yes, this will be u.2 Nvme but the storage will be 1.92Tb to 15.85TB NVMe u.2. But I dont have any idea on how many units required

- Dell R630s support up to 4 NVMEs but only on the 10 drive models. Since you cant get these new, be aware that most 10 drive models you will find in the wild dont actually have NVME support and you will have to buy and install it separately.
Answer : Yes, I aware of this. for the 10 bays it will be use for up 4 x NVMe U.2 with the card riser and extender.


NETWORKING:
- ideally you need separate interfaces for ceph, ceph private, cluster, and service networks. BMC too. Be aware that R630 is a PCIe gen3 platform which means your maximum practical link speed is 100gbe, and some of your pci lanes will be consumed by your nvmes (16 lanes total) so 4x25g is a good practical configuration for this generation of hardware. you'll want to get your port count based on your node count, and then provision 2 switches that can accommodate half each.
Answer : Yes, 10Gg, 25Gb or 40Gb is the plans. Cluster will be 1GB. any related to ceph will be use 10,25 or 40GB.

The thing is, I am not sure either to use a Pure HCI meanse compute and ceph will be in a same server or separate the Nodes and Ceph( means 7 Compute Nodes + 3 Ceph Nodes.

What is your opinion?
 
  • Like
Reactions: Johannes S
First a disclaimer: I never used Ceph myself, my superficial knowledge is just from reading the manuals and lurking in this forum. So take my ramblings with a grain of salt ;)
Steve already linked Udos writeup on small clusters, but you should consider that although more nodes are preferred they are not really needed if you plan accordingly and know of the involved riscs. Basically if one of the three nodes fail no other nodes can fail, while with more nodes you can also tolerate more node failures. In a practical sense this means that if you have three nodes and one of them is down (due doe a reboot after updates or because you are doing maintenance like replacing discs) no other node can fail. Depending on your usecase that risc might be acceptable. I know that @LnxBil and @Falk R. mentioned that several of their smb customers have a three-node cluster and are quite happy with it. The risc of a downtime due to the outage of two nodes at the same time was considered but termed acceptable for them. Of course your own risc estimation might yield a different result.

Another thing to give Udos (great and highly recommended reading) piece some perspective: He wrote it as a kind of case study of edge cases for homelabbers: His scenario assumes that you have only one ods (to make things worse maybe even just a HDD) on each node and only NICs under 10 Gbit/s or in extreme cases (think old mini-pcs) just one network card. With such a setup a three-node Ceph cluster might be fun to play around with but will obviovsly not useful in an enterprise context. But in an enterprise context you would use at least four OSDs (which should be enterprise ssds with power-loss-proetection) per node and use dedicated networks for Ceph and cluster communication:



Network
We recommend a network bandwidth of at least 10 Gbps, or more, to be used exclusively for Ceph traffic. A meshed network setup ( https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server ) is also an option for three to five node clusters, if there are no 10+ Gbps switches available.

Important The volume of traffic, especially during recovery, will interfere with other services on the same network, especially the latency sensitive Proxmox VE corosync cluster stack can be affected, resulting in possible loss of cluster quorum. Moving the Ceph traffic to dedicated and physical separated networks will avoid such interference, not only for corosync, but also for the networking services provided by any virtual guests.
For estimating your bandwidth needs, you need to take the performance of your disks into account.. While a single HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can already saturate 10 Gbps too. If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps of bandwidth, or more. For such high-performance setups we recommend at least a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full performance potential of the underlying disks.

If unsure, we recommend using three (physical) separate networks for high-performance setups:

one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster traffic.

one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the ceph server and ceph client storage traffic. Depending on your needs this can also be used to host the virtual guest traffic and the VM live-migration traffic.

one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync cluster communication.

https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster


The story looks quite different if you have suitable storage and network hardware and plan accordingly, then even a three-node-cluster might work fine if you are comfortable with the possible risc of a two-node-failure. See the referenced links in the wiki not only for network but also storage recommendations.

The thing is, I am not sure either to use a Pure HCI meanse compute and ceph will be in a same server or separate the Nodes and Ceph( means 7 Compute Nodes + 3 Ceph Nodes.

You could do this but imho the better idea would be to distribute the discs on as many nodes as possible as long as you have at least four osds per node which together will result in the same capacity on all nodes (so one OSD of each size should be present on all nodes). If you don't have the hardware to have OSDs on all nodes, you could use some nodes as "compute-only" or "storage-only". In my humble there is no benefit of splitting the ProxmoxVE and Ceph Cluster although it's technical possible. But your mileage may vary so take this with a grain of salt ;) In general: The more nodes and the more osds per node the better there are calculators to play with possible setups:


Since right now you are still in the planning phase you could also do some benchmarks with different setups, this is what I would do. If your setup is for the usage in a company I would also try to get budget from your boss to visist a ProxmoxVE training and get some consulting from a Proxmox partner.
 
Last edited:
I use Dell R630s in production as Proxmox Ceph clusters. These were converted from VMware/vSphere. They all have the same hardware (CPU 2 x 2650v4, Storage SAS 10K, Storage controller HBA330, RAM 512GB, NIC Intel X550 10GbE) running latest firmware. Ceph & Corosync network traffic on isolated Arista switches.

Minimum sized cluster is 5-nodes (can lose 2 nodes and still have quorum). Largest is 11-nodes. As mentioned, Ceph is a scale-out solution. More nodes = more IOPS. Not hurting IOPS. Workloads range from databases to DHCP servers.

For new workloads, I do NOT recommend R630s unless you getting them for free.

For new, I would get single-socket servers with lots of cores. Like an AMD Epyc 4005 or older Epycs. Of course with flash storage.

I use the following optimization learned through trial-and-error. YMMV.

Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host' for Linux and 'x86-64-v2-AES' on older CPUs/'x86-64-v3' on newer CPUs for Windows
    Set VM CPU NUMA
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed and VirtIO drivers on Windows
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option
 
  • Like
Reactions: Johannes S
Answer : All the nodes will be put in 52u Racks in the tier-3 datacenter with a 5 - 10kw power.
That was assumed. I meant, how much power are you BUYING?

Answer: I planned to use 2 x Intel Xeon E5 v4 2699
Thats well and good but doesnt really address the load question, which means we still dont know how many nodes you'd need. You should also be aware that at 2.2GHz its single thread performance is very low by today's standards so be sure it fits within your workload criteria.

eans 7 Compute Nodes + 3 Ceph Nodes.
How are you sizing the cluster when you havent defined what load or how much storage? In any event, 3 OSD nodes is not sufficient for production, and even with 4 you have 16 slots, which doesn't lead to a high performing configuration.