Questions for new CEPH cluster

prahn

Active Member
Dec 19, 2020
67
5
28
48
I will convert my 2-node-cluster with local storage to a 3-node-cluster with CEPH.

The 3 nodes will look like this:
Supermicro chassis with 8 HotSwap Bays
Xeon 4110 or 4215R
Supermicro Mainboard X11SPi-TF (with 2x 10 GBit onboard)
128 GB RAM
BROADCOM HBA 9400-8i
2x 1 GBit PCIe card
2x 480 GB SSD for Proxmox OS on ZFS-RAID1
3x 2 TB Intel D3-S4610 for CEPH

Now I have 2 main questions left:

1. I am currently using 1 GBit NIC for LAN and 1 GBit NIC for DMZ.
The 10 GBit ports should be used for meshed network for CEPH (without a switch).
Do I really need another separate networks for CEPH?

2. Will 3x 2TB be enough for CEPH? This is 9 disks in total, but I read about a recommendation of minimum 12 disks?
Do I need any extra disks for DB/WAL or Bluestore??
 
Do I really need another separate networks for CEPH?
this is going to be a nuanced answer based on what you mean. Should you comingle non ceph traffic with a physical interface housing your ceph traffic? no. you really shouldnt because saturating this interface will kill your cluster. should you comingle your ceph private and public interface? thats more of a performance expectation question, since doing so will half your observed performance.

2. Will 3x 2TB be enough for CEPH? This is 9 disks in total, but I read about a recommendation of minimum 12 disks?
Do I need any extra disks for DB/WAL or Bluestore??
Again, that's more to do with performance expectations. This sounds innocent enough, but you really need to think through your usecase and what you're expecting. I imagine it should work fine for a dozen or so VMs, as long as you're not expecting a gagillion iops, or more then 6TB usable.
 
I would suggest 4 OSDs per hosts on a 3 node cluster.

In a three node cluster with size/min_size=3/2 each node has to have exactly one replica of each object. If one OSD fails you need to replicate the objects in this OSD into the other OSDs of this node in order to satisfy the requirement.

For example, in a cluster with two OSDs per node, if both OSDs are at ~45% usage and one fails, the other will end need to take over and will end up at ~90% usage. When Ceph sees a single OSD at ~92% usage it will freeze all IO in the entire pool, which is undesirable. So if you take into account failure scenarios 2 OSDs is arguably worse than a single OSD per node. The situation improves as you add more OSDs and 4 OSDs seems like a good balance, but 3 can certainly work if you account for possible failure scenarios.
 
  • Like
Reactions: prahn
The situation improves as you add more OSDs and 4 OSDs seems like a good balance
For example, in a cluster with two OSDs per node, if both OSDs are at ~45% usage and one fails, the other will end need to take over and will end up at ~90% usage.

Seems to me like 4 is an arbitrary number; the risk of OSDs tripping high watermark is a function of total OSD utilization per node/total OSD capacity per node; if the used ratio is 45%, it would be 67.5% post rebalance to two surviving OSDs (3 OSDs/node), or 60% post rebalance with 4 OSDs/node. neither will trip (yet.)

math:
0.45 * 3 = 1.35
1.35/2 = 0.675

0.45 * 4 = 1.8
1.8/3= 0.60

That said, 4 OSDs/node will give you 25% more potential iops vs 3. it will also perform more predictably under rebalance- but this is also true for 5, 6, 10, etc.
 
  • Like
Reactions: prahn
Ok, great, thank you for replying, at least I know now, why 4 SSDs are recommended and what I have to take care about.

this is going to be a nuanced answer based on what you mean. Should you comingle non ceph traffic with a physical interface housing your ceph traffic? no. you really shouldnt because saturating this interface will kill your cluster. should you comingle your ceph private and public interface? thats more of a performance expectation question, since doing so will half your observed performance.
Yes, this is a small cluster with only 10 VMs and 4 LXC.
I separate public LAN traffic from DMZ traffic and I also separate CEPH network.
What other traffic is needed? Is there really fourth network needed for Corosync or anything else?
We are a small company with not more than 15 people.
 
What other traffic is needed? Is there really fourth network needed for Corosync or anything else?
These are the types of traffic you will contend with on a hc proxmox cluster

1. corosync
2. ceph private
3. ceph public
4. vm traffic (in your case, private and dmz)

corosync and ceph are both sensitive to latency issues, so best practices suggests to keep them on their own separate interfaces to avoid being subject to contention. interruption on the corosync interface can break your whole cluster so when uptime is of a concern 2 interfaces can be deployed. ceph public and private interfaces generate traffic at the same time for user access, and the private interface can be busy with other functions (eg, rebalance) so they should also be kept separate for best performance- but all these are guidelines. Its possible to run all of these on the same network, or vlans on the same interface, etc.

In an IDEAL world:

2 interfaces for corosync
2 interfaces for ceph public (active/passive or LACP lagg)
2 interfaces for ceph private (active/passive or LACP lagg)
2 interfaces for vm traffic (active/passive or LACP lagg)

each pair of interfaces to connect to a different switch.
 
  • Like
Reactions: prahn
Great, thank you for explanation.
But what's best practice for a small 3-node-cluster like this, when the hosts have "only" 4 network interfaces?
2 interfaces are used for meshed CEPH network, so there are only 2 interfaces left.
How to distribute the network services over the interfaces?
Or should I better invest in more network cards?
 
Just to be clear, with 4 OSDs you can use up to ~67% (rather than 45%) of any of them before removing a single one will cause others to push over ~90%. Please have in mind that these are very rough estimations, not all OSDs on a single node will have the same usage as not all placement groups will be of the same size.