First production cluster with CEPH

_KriS_ · May 27, 2021

Hi,

we use Proxmox server as single node and it works perfectly fine. I would like to move all VM at 3 node cluster with ceph and subscription.
Im looking at SuperServer SYS-520P-WTR and my questions is:

Can I use 2x 10Gig + 2x 25Gig and build mesh infrastructure without switch
I'll have 2x SSD for Proxmox OS, and what it better for OSD - 1x SSD 2TB NVME at each node or more SSD like 2+ 1TB SSD NVME
If I'll put 2x SSD at each node for OSD I'll have redundancy if one node will be down, and one SSD disk will fail at remaining nodes?

--
KriS

VictorSTS · May 27, 2021

1. Can I use 2x 10Gig + 2x 25Gig and build mesh infrastructure without switch

Yes, you could. It is explained quite well:

https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

But honestly I would not take that route, as you would have to use the 25G ports for the mesh and leave the 10Gb for comunication from the cluster/VM with the world, which probably would be a waste as they might be connected to 1G switches (because, well, if you had a 10Gb switch already you wouldn't be thinking about a meshed cluster). Other reason is that it makes hard to add nodes to that cluster in the future.

Strecht you budget and get a couple of 10Gb switches with proper LACP support. Use the 10gb interfaces for VM traffic/Management traffic and the 25Gb interfaces for Ceph Public/Cluster traffic. Even if they link at 10GB you will have up to 2x10Gb of Ceph bandwidth per server. Should be enough given the low amount of OSD's you plan to use. In the future you could get a 25GB switch and simply connect the 25Gb nics there.

2. I'll have 2x SSD for Proxmox OS, and what it better for OSD - 1x SSD 2TB NVME at each node or more SSD like 2+ 1TB SSD NVME

That will just balance price/future expansion as a single server fits just so many disks. Using 2 SSD will use a couple of disk bays instead of just one if using a 2TB disk.

With networks >=10Gb I always tend to partition NVME disks (aprox 500Gb partitions) and set up an OSD for each partition, as that will increase the chance of using all links in a LACP 3+4 bonding. Keep in mind that you may need to adjust the Ceph CRUSH map, so in the beginning just stick to 1disk=1osd to make it easier.

If I'll put 2x SSD at each node for OSD I'll have redundancy if one node will be down, and one SSD disk will fail at remaining nodes?

Well, it depends on the pool configuration, how many free space you had and on the sequence of events that produced that situation. I'm pretty sure that Ceph won't let you I/O to the remaining OSD's, although that dependens on some settings.

Remember: ceph is very good at healing itself but it is designed to be used in much bigger scenarios, so a tiny 3 node cluster with two disks each is somewhat a corner case and every detail matters.

Let's suppose we have the default pool configuration of 3 replicas, 2 osd per server and events happens like this:

1- Server1 goes down: you lose 2 osd, 1 manager, 1 monitor. Cluster is up as monitors still have quorum (2 of 3). 1 OSD is marked DOWN. All you PG's will lose redundancy as they no longer can comply with the 3 replicas on 3 different servers as you only have 2 now. A warning tells you about all this.

2- 10 minutes later Server1 is still down and serv1.osd1 and serv1.osd2 get marked OUT (as set by mon_osd_down_out_interval). Ceph will try to comply with the 3 replicas of your pool creating copies on the remaining OSDs even if they are in the same host (and thus making an exception to the default CRUSH map). If you have enough free space (check ceph osd set-nearfull-ratio, ceph osd set-full-ratio, ceph osd set-backfillfull-ratio) you will get 3 replicas of your data. A warning about one manager and one monitor down still remains, as others regarding remapped PGs, of course.

3- After some time, you lose serv2.osd3. Ceph will mark it as DOWN and 10 minutes later as OUT, forcing a new rebalance among the surviving OSD. If you have enough free space in those OSD, Ceph might be able of recreating the replicas. Finally, your pool will end with 3 replicas on the surviving 3 OSDs, even if 2 of them are on the same host (serv3).

4- Eventually, you lose serv3.osd5. Ceph will mark it as DOWN and 10 minutes later as OUT. I'm mostly sure that Ceph won't be able to rebalance on the remaining OSD's even if there was enough free space, because it will not place two replicas of the same PG in the same OSD. Your pool has 3 replicas but you've got just 2 OSD's now.

Notes:

- If any OSDs fails during rebalancing things will get ugly. For some time, some PG's will be left with just one replica an Ceph won't let you access them until there's at least two.
- If you fill your OSDs during rebalancing you'll have a hard time recovering the cluster unless you add new OSDs.

To sum it up:

- A 3 node Ceph cluster with that few OSDs is a corner case for Ceph and every component counts towards its availability. Ceph is resilient but can't make miracles just yet.
- Make sure you have enough OSDs to survive common failures.
- Get replacement NVMe disks if possible. Alternatively, find hardware providers who stock the components you use so you can buy then and get them shipped asap.
- Watch out for free space. Recomended not to get over 67% of total OSDs space to allow Ceph to rebalance if needed (in a 3 node cluster with 3 replica's pools). You don't want a full OSD, believe me.
- Do not skimp on network and disks quality, thats the basis for Ceph.

_KriS_ · May 31, 2021

Oh s*it! A lot of learn before me.
Thx for detailed description. I need read this few times, before I'll understand everything.

My main goal is HA for 3 VM (SQL Srv, RDP Srv and Terminl GW srv) even if I'll do break for 5-10 min it's not a problem (VM startup at another node)
I'll not add more nodes, I'll not add more SSD, because 2TB is enough for system what I needs, that's why I thing about simple 3 nodes setup.
When you wrote about dived space at NVME SSD, maybe instead of buy one big drive I'll buy 4x 500GB?
Im gonna try this at VMs at lab, before buy any hardware but need little help from someone push me at good direction.

exebat · Sep 20, 2024

Hi @_KriS_

I have a same set of VMs. What did you do at the end ?

Search

Search

First production cluster with CEPH

_KriS_

Member

VictorSTS

Distinguished Member

_KriS_

Member

exebat

New Member

We value your privacy