Advice for Ceph Network for NVME based Server

Ingo S

Renowned Member
Oct 16, 2016
348
42
93
41
Hi Forum

We are planning to replace our current 6 node PVE/Ceph Cluster with newer Hardware. Most likely it will be based on full NVME storage, so huge amounts of IOPS and bandwith available. To make use of all the bandwidth, we are looking for a network setup that can keep up with this.
Right now i am trying to estimate costs for the network setup.

Planned structure is as follows:
- 2 locations with 3 servers each (we will take precations against a split brain!)
- Locations are connected via 540m 48x OS2 fiber
- Ceph, Corosync and Client Network have to be physically separate i.e. no VLAN. Same switch but different port per Network is okay.
- 16 ports per location might be suitable (3 ports per Node), if we plan to add one or two additional nodes to the cluster.
- Maybe it might even be better, to just use 1 Switch on Location 1 and connect every node on Location 2 directly to the switch on Loc1. This will eliminate bandwith congestion on the uplink between Loc1 and Loc2 ?

My questions are:
  • What is important to look out for, when buying a switch for this?
    I guess, besides bandwidth, latency becomes very important, especially on NVME and high IOPS.

  • Does anyone have some advice on which type of switch to use?
    We mostly use cisco, but we are open to other manufacturers as well.

  • ... any other ideas?
 
Planned structure is as follows:
- 2 locations with 3 servers each (we will take precations against a split brain!)

you need 3 locations, at least a monitor on a third site.
with 2 locations, if you loose 1 location (power failure, disaster,...), the other location will be readonly.

What is important to look out for, when buying a switch for this?
I guess, besides bandwidth, latency becomes very important, especially on NVME and high IOPS.
switch latency is important (but you already have 540meter distance, I don't known how much latency you have here).
But more important, use highest cpu frequency possible (both for ceph nodes, and proxmox nodes). (higher frequencies, less cores).

- Maybe it might even be better, to just use 1 Switch on Location 1 and connect every node on Location 2 directly to the switch on Loc1. This will eliminate bandwith congestion on the uplink between Loc1 and Loc2 ?
so if loc1 is down, loc2 not working anymore too.


Why do you need 2 locations exactly ? for disaster recovery ?
 
you need 3 locations, at least a monitor on a third site.
with 2 locations, if you loose 1 location (power failure, disaster,...), the other location will be readonly.
Jepp, i know this. We do not have a third location with sufficient connection where we could deploy some servers. The rest of our locations is connected via quite old >10yr OM2 fibre sadly. Not much more than 1G possible...

switch latency is important (but you already have 540meter distance, I don't known how much latency you have here).
Hmm based on length and lightspeed in an optical fiber (~0.7c) the length contributes about 2.6µs to the overall latency. Seems negligable to me.

But more important, use highest cpu frequency possible (both for ceph nodes, and proxmox nodes). (higher frequencies, less cores).
We aim to get AMD Threadripper with 8 Cores and 3.7Ghz. This is quite the max Ghz we can buy on that platform, but it seems sufficiently fast for me.
BTW: PVE nodes = Ceph nodes in our setup.

so if loc1 is down, loc2 not working anymore too.
Why do you need 2 locations exactly ? for disaster recovery ?
Well, Loc1 is our city hall. It is our main location for most of our IT Infrastructure. Currently everything is located there. Most of the other minor locations are connected to our city hall. If there would be a fire, or maybe a lightning strike we loose everything. Yes we have a UPS with OVP and all this stuff but you never know, its an old building.
Our idea was, to distribute our servers, AND more importantly, our Backup solution as well, to at least 2 sites, to mitigate some of that risk.
We are still searching for a good solution, so these are first thoughts.
It might be better still, to just keep everything on Loc1 and just put a second backup solution offsite. I'm still trying to find a good concept...

Since budget planning for the upcoming year is due, i am mostly interrested in finding some hardware to estimate prices for this project.

Currently we need 3 connections per host -> 18 Connections
6x Ceph (25G? 100G?)
6x Clients (10G will be plenty)
6x Corosync (1G)

Does anyone have good suggestions for some Switches? 25G? 100G?
If we keep everything on Loc1, we just need two 16x25/100G Switches (redundancy and future expansion).
 
The rest of our locations is connected via quite old >10yr OM2 fibre sadly. Not much more than 1G possible...
well, it's more than enough for a third monitor.
you don't need to have the storage (osd service) on the third node , you only need to run the monitor service. (some mb/s should be enough). Latency can be higher too.


if it's really possible, you should create 2 separate ceph cluster on each loc , and use ceph mirroring feature (not avaible in proxmox gui, but you can manage it directly in ceph dashboard) to replicate them. (so a primary active ceph cluster, and a secondary ceph backup cluster).

for the swiths, I'm personnaly use mellanox switchs, they are very good.
About speed, if you need a lot of iops, I think you don't do a lot of bandwith ? 25gb should be enough.
sending 100gb with 1server require a lot of tuning and cpu.
 
  • Like
Reactions: Ingo S
well, it's more than enough for a third monitor.
you don't need to have the storage (osd service) on the third node , you only need to run the monitor service. (some mb/s should be enough). Latency can be higher too.
Something like this was our plan. We will take appropriate measures against a split brain, if we are at risk to run into such a situation. Nevertheless thx for the advice.

if it's really possible, you should create 2 separate ceph cluster on each loc , and use ceph mirroring feature (not avaible in proxmox gui, but you can manage it directly in ceph dashboard) to replicate them. (so a primary active ceph cluster, and a secondary ceph backup cluster).
Seems to be a nice solution for a disaster-proof Setup. But, this will require double the amount of Servers and storage space. Since nearly everything will be down anyways, if we have a mayor outage at our main location, this might be a little too expensive. We will have a full offsite Backup Plan to prevent data loss in case of a fire etc.

for the swiths, I'm personnaly use mellanox switchs, they are very good.
About speed, if you need a lot of iops, I think you don't do a lot of bandwith ? 25gb should be enough.
sending 100gb with 1server require a lot of tuning and cpu.
Thx. I will look into the Mellanox series of switches.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!