Best Practices for Setting Up Ceph in a Proxmox Environment

sagara-k

New Member
Jun 12, 2024
2
0
1
Hello Proxmox Community,

I am currently managing a Proxmox cluster with three nodes and approximately 120 hosts.
I am planning to set up Ceph for storage and would like to understand the best practices for such a configuration.
Despite my research, I haven't been able to find clear guidance on this topic.

Could you please provide insights or recommendations on the following points?
1. Recommended hardware specifications for achieving optimal IOPS and latency.
2. The maximum number of hosts that can be efficiently managed in a Ceph cluster.
3. Best practices for configuring Ceph in a Proxmox environment, including any specific considerations for a three-node setup.
4. Any tips on scaling the cluster or potential pitfalls to avoid.
5. Examples of successful configurations or case studies.

Your assistance and any additional advice or resources would be greatly appreciated.

Thank you!
 
I manage several production 5- and 7- node Proxmox Ceph clusters. Why 5 or 7 nodes? Well, with 3 nodes, you can only tolerate a single node failure and I believe because of lack of quorum, no writing of data will occur. Strongly suggest 5-nodes at minimum that way one can tolerate 2 node failures. Also, the more nodes the better. With that many VMs on 3 nodes, you'll take out 1/3 of the VM fleet.

Optimal Ceph configuration is to have Corosync, Ceph public, and Ceph private on different switch infrastructure. However, the clusters I manage the Ceph public, private, and Corosync are on isolated switches at 10GbE. Is this optimal? No but it works. Also, these clusters are using 10K SAS HDDs with no issues. Plenty of IOPS for the VMs running from databases to dhcp/pxe services. This of course wholly depends on the workload. More HDD spindles equals more IOPS. With that being said:

1. You'll want identical hardware with the same amount & type of storage, same CPU family, same amount of RAM, and same networking infrastructure. Nodes I manage are all 13th-gen Dells.

2. Since Ceph is a scale-out and NOT a scale-up solution, theoretically there is no limit. I've heard of Ceph clusters with 100's of nodes. If going to use flash storage make sure it is enterprise level and has PLP (power-loss protection). You'll want to mirror Proxmox using small OS drives. I use ZFS RAID-1 for this. I use Proxmox Backup Server on stand-alone bare metal servers for backing up the VMs.

3. You should really reconsider 3-nodes because of the single-node failure scenario. With that being said, I use the following optimization learned through trial-and-error and write IOPS are in the hundreds while read IOPS are 3x-5x (sometimes higher) than write IOPS. Again not hurting for IOPS for my production workloads.

Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
Set VM Disk Cache to None if clustered, Writeback if standalone
Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
Set VM CPU Type to 'Host'
Set VM CPU NUMA on servers with 2 more physical CPU sockets
Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
Set VM Qemu-Guest-Agent software installed
Set VM IO Scheduler to none/noop on Linux
Set Ceph RBD pool to use 'krbd' option

4. See the caveat about 3-node setup. Make sure nodes are homogeneous in specifications. While it's true that on 13th-gen and beyond, Dell storage controllers can be converted from RAID to HBA mode, I suggest using a pure IT-mode controller. For 13th-gen Dells, that will be the HBA330.

5. Journey with Proxmox Ceph started when Dell/VMware dropped official support for 12th-gen Dells. Looked around for virtualization alternatives and started with Proxmox 6. Now with the Broadcom acquisition of VMware and attendant license cost increases, migrated the 13th-gen Dells to Proxmox Ceph. No issues besides the typical SAS HDD dying and needing replacing.

As always, test your configuration before putting into production. Obviously any newer server generation family will out perform any older server generation. Again, it all depends on the workload.
 
Dear jdancer,

Thank you very much for your detailed and insightful response.
I truly appreciate the time you took to provide such comprehensive guidance.
I have some additional comments and questions based on your feedback.

Why Three Nodes?​

The reason for starting with three nodes is that we are currently evaluating Proxmox as a potential migration destination.
Based on your recommendation, we will certainly consider expanding to five nodes.

Workload Dependency​

I understand that performance is heavily dependent on the workload.
However, since testing in the production environment is not an option for us, we are struggling to determine the essential requirements for our setup.
Any advice on how to effectively estimate these requirements would be greatly appreciated.

HBA Cards​

We currently do not have HBA cards available but will make arrangements to procure them as suggested.

IOPS Calculation​

I am very interested in understanding the basis for your IOPS calculations.
Could you please share more details on how you measured the IOPS for your setup?

Once again, thank you for your invaluable assistance. Your experience and recommendations have been very helpful to us.

Best regards,
sagara
 
Hello,

Well, with 3 nodes, you can only tolerate a single node failure and I believe because of lack of quorum, no writing of data will occur.

With 3 nodes you should not see any such issue if one node goes down.

There are two things that can block IO:

- Running out of space
- Being bellow the minimum number of replicas for a single object

The default settings is size=3 and min_size=2, meaning that all objects will have one replica per node. If a node goes down then you will have objects with only two replicas and that does not pose in itself a risk as long as no other OSD or node goes down.


On the other hand if an OSD goes down, all the objects that were on it will have to be replicated to the other OSDs in the node. This is done so that the node still holds a replica of every object, meaning that the usage of the remaining OSDs will go up. Here is where it is important to leave enough empty space for Ceph to be able to replicate and avoid having IO blocked, which happens if a OSD gets around 90% full. My personal recommendation is to either give one OSD per node or 4+, with 2 or 3 OSDs per node it is very easy to run out of space if a OSD fails.

From an operational point of view a 3-node Ceph cluster is a perfectly good solution.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!