Ceph Network public and cluster some questions

Sycoriorz

Well-Known Member
Mar 19, 2018
45
4
48
36
Dear,

I want to setup for ceph an full mesh network over dual 100gbe nic.
In the documentation "Deploy Hyper-Converged Ceph Cluster" it is recommended to split the public and cluster network of ceph.
If i am right, i have written that is recommended to have redundancy in the ceph network.

To my question.
If it is both recommended and i separate the "network" and "cluster" network and want to have additional the redundancy for ceph network.
I reach that with one dual(1x2) nic or i need two dual nics (2x2)?

If i need 2x2:
It is needed to have for public and cluster 100gbe each?
Or for one of both an 10gb would be enough?

I want to do an allNVMe Setup with mellanox 9300 MAX over 3 nodes.
Beginning 4 NVMe by node is planed.
Maximum is 8 NVMe by node

Many thanks for help
 
I have my CEPH Cluster Setup as followed:

3 Servers with 2 RJ45 1GbE Connections in Bond-Mode Failover.
Those three Servers each have a Dual SPF+ Network card.

I connected them without a switch and used bond-mode broadcast for that (Maybe that's not a such good idea with 100GbE)
My Cluster Network is on the SPF+ Network and my public network on thr RJ45.

Works like a charm for me, although i don't really like the broadcast bond mode.

In Terms of redundancy in this three node setup it should be fine.
 
3 Servers with 2 RJ45 1GbE Connections in Bond-Mode Failover.
this is for (pmxcfs) corosync?

My Cluster Network is on the SPF+ Network and my public network on thr RJ45.
SFP+ 10gbe for cluster network.
You are using NVMe?
If yes how many OSDs?

So it seems to be, that the public network is not hungry for high speed as the cluster network.
Somebody else can confirm that?
Exist there an recommendation for speed in public network?

So if i am right i need to seperate for the complete cluster network 3 networks which is internal for replica and communication between nodes?

1. corosync - 1gbe
2. Ceph public - 1gbe
3. Ceph network - 10gbe or maybe 100gbe

(Maybe that's not a such good idea with 100GbE)
I think to do it in 100gbe regarding the ceph benchmark which is be done in 100gbe.
There it is recommended to use 100gbe if a setup is made with NVMe.

https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-mit-nvme.76517/

This is the block where i read the recommendation.
Page one
"Hyper-converged setups can be deployed with Proxmox VE, using a cluster that contains a minimum of three
nodes, enterprise class NVMe SSDs, and a 100 gigabit network (10 gigabit network is the absolute minimum
requirement and already a bottleneck)"

The Full Mesh Setup i want to build like on this documentation
file:///D:/NextCloud/BMS/clusterbau/Proxmox/Mesh-Network-Ceph-Crossover-Proxmox.html
 
this is for (pmxcfs) corosync?
yes

SFP+ 10gbe for cluster network.
You are using NVMe?
If yes how many OSDs?
I'm using SSD's. 3 OSD's per Server.

So it seems to be, that the public network is not hungry for high speed as the cluster network.
Somebody else can confirm that?
Exactly. I don't see a lot of traffic on my public network

file:///D:/NextCloud/BMS/clusterbau/Proxmox/Mesh-Network-Ceph-Crossover-Proxmox.html
This is probably that: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
 
This is, what the Wiki says (https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster):
  • Public Network: You should setup a dedicated network for Ceph, this setting is required. Separating your Ceph traffic is highly recommended, because it could lead to troubles with other latency dependent services, e.g., cluster communication may decrease Ceph’s performance, if not done.
  • Cluster Network: As an optional step you can go even further and separate the OSD replication & heartbeat traffic as well. This will relieve the public network and could lead to significant performance improvements especially in big clusters.
Concerning the "Ceph Cluster Network" from the Ceph Project (https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/):

It is possible to run a Ceph Storage Cluster with two networks: a public (client, front-side) network and a cluster (private, replication, back-side) network. However, this approach complicates network configuration (both hardware and software) and does not usually have a significant impact on overall performance. For this reason, we recommend that for resilience and capacity dual-NIC systems either active/active bond these interfaces or implemebnt a layer 3 multipath strategy with eg. FRR. If, despite the complexity, one still wishes to use two networks, each Ceph Node will need to have more than one network interface or VLAN. See Hardware Recommendations - Networks for additional details.

It could be sometimes a little bit confusing, because Proxmox has also a Cluster Network. The "Proxmox Cluster Network" is Corosync which is independent from Ceph. Ceph traffic and the Corosync traffic should always be seperated from each other, because latency problems in the corosync connection can lead to troubles with your whole cluster.

Here the recommended network structure from proxmox, which also could be realized with a mesh:
Public is the "Proxmox Public Network" for your VM's. Cluster is the "Proxmox Cluster Network" (Corosync) and Storage is your "Ceph Network" (Public and Cluster for Ceph)

Proxmox Ceph_small.PNG
 
Last edited:
  • Like
Reactions: takeokun
Many thanks.
I was confused if there is required 3 dedicated networks.
Proxmox Ceph_small.PNG

Public is the "Proxmox Public Network" for your VM's. Cluster is the "Proxmox Cluster Network" (Corosync) and Storage is your "Ceph Network" (Public and Cluster for Ceph)
This means if i want to separate as recommended ceph cluster and ceph public i need in sum 3 dedicated networks including the proxmox cluster network

3 dedicated networks:

1. ceph cluster network
2. ceph public network
3. corosync

4. is the connection for clients to the backend.

-----------------------
The only question from my side which is open is the required speed in nic for allNVMe in the public network.

1. ceph cluster network (100GBe)
2. ceph public network (10 GBE????)
3. corosync (1GBe)

4. is the connection for clients to the backend. (1Gbe)

many thanks best regards
 
If you have slots for 1Gbe, use it for corosync.
Now - split ceph to 2 networks
C1] ceph cluster (osd etc) - 100Gbe
C2] ceph public ( monitors = client access) - 10Gbe minimal

Now Proxmox side:
P1] pve cluster (corosync) - 1Gbe primary, 1Gbe secondary (or use ceph backend or pve frontend)
P2] pve frontend (management, etc) - 10Gbe minimal

Because any VM is connecting to ceph monitors and some external clients can connect too, usually C2 and P2 are the same subnet, splitting this has its cons/pros.

So if i ignore that you are doing mesh (i usually think in vlans)
1] vlan for corosync primary 1G
2] vlan for corosync secondary 1G or via 3]4]
3] vlan for ceph cluster 100G
4] vlan for pve frontend = ceph public 10G
5] vlans for VMs via 4]

So, totally 4 subnets/vlans for basic hyperconverged solution.

The main importance - think twice about subnets, especially because ceph monitors IP can't be changed such easily.
 
Last edited:
This means if i want to separate as recommended ceph cluster and ceph public i need in sum 3 dedicated networks including the proxmox cluster network

Where is the recomandation to seperate ceph cluster and ceph public network? Proxmox says, "as an optional step you can go even further and separate the OSD replication & heartbeat traffic as well. This will relieve the public network and could lead to significant performance improvements especially in big clusters." If you seperate your ceph network (storage network in the picture) you seperate just the osd replication & heardbeat from the "normal" ceph traffic. Which can be, like proxmox, a performance improvement in big clusters (i do not think that it is a big cluster, if you use a mesh network), while Ceph says it "does not usually have a significant impact on overall performance".

If you use your 100GBit link for Ceph (Ceph public and ceph cluster) you are perfetly fine. To seperate this networks is just important in bigger clusters because if you have a rebalance (OSD' crashing or nodes are down) ceph automatically distributes your data to fulfill the given size (2 or 3 or even more numbers the data exists on different OSD's). If you have many OSD's the probability that one or more have an defect is higher.

You should also think of the fact, that your Proxmox Public Network (not Ceph) is normally used for backup (if not seperated) and also for tranfering VM's from one node to another (live migration). If you migrate VM's which are on a ceph storage no data from the disks will be transfered but the RAM of the running VM's. If you want to live migrate all VM's from one node to another (in case of an update without impact on the VM's or for other maintenance reasons without downtime of the VM's) this can be all the used RAM on the node. (Sometimes hundreds of GBs) Therfore you should use an 10Gbit link if possible.

For Corosync an 1GBit Connection is fine if the latency is low enough (Test it before. Should be < 6-7ms if all nodes are communicating)

In general you should also think about redundancy of the links. Especially for the corosync network. Corosync should also be seperated with an dedicated link. A VLAN on another connections is not engought, because a higher load on the physical connection will lead to latency problems.

4] vlan for pve frontend = ceph public 10G

PVE frontend is not Ceph Public. This are totally different networks.
 
Last edited:
  • Like
Reactions: takeokun
thanks for detail response.
4] vlan for pve frontend = ceph public 10G
5] vlans for VMs via 4]

I have some 10gbe nics free which i want to use only in bridgemode for the guest-vms.
I believe you mean that in Point 5

4] in my scenario this would be in my main network with all my users in the company.
This would be an bad decision regarding interruption of the ceph public transfers?
Or you would say doesnt matter?
Or i should use to administrate an admin pc that is in the same VLAN like the public?
For the security it is not so important to do that.
But if it is required to do that regarding performance we must do it.

thanks
 
If you seperate your ceph network (storage network in the picture) you seperate just the osd replication & heardbeat from the "normal" ceph traffic. Which can be, like proxmox, a performance improvement in big clusters (i do not think that it is a big cluster, if you use a mesh network), while Ceph says it "does not usually have a significant impact on overall performance".
So you are right. I thought i have read that in the documentation. I dont have an big cluster as you have already said.

For Corosync an 1GBit Connection is fine if the latency is low enough (Test it before. Should be < 6-7ms if all nodes are communicating)

In general you should also think about redundancy of the links. Especially for the corosync network. Corosync should also be seperated with an dedicated link. A VLAN on another connections is not engought, because a higher load on the physical connection will lead to latency problems.
So for avoid latency problems. I set corosync also on full mesh without switch. I have redundancy when i do the full mesh like desrcibed in the documentation?

If you migrate VM's which are on a ceph storage no data from the disks will be transfered but the RAM of the running VM's. If you want to live migrate all VM's from one node to another (in case of an update without impact on the VM's or for other maintenance reasons without downtime of the VM's) this can be all the used RAM on the node. (Sometimes hundreds of GBs) Therfore you should use an 10Gbit link if possible.
This makes complete sense. So in this case i need to setup 10gbit.

____________________________________

OK so i think with this informations i have understand the setup.

1. Ceph network (Cluster+Public) 100Gbit dual nic full mesh. Dedicated
2. Corosync 1gbit dual nic full mesh for reduce latency. Dedicated
3. PVE-Connection 10gbit single

I hope this is now the best usecase for my setup.
Thanks a lot.
 
thanks for detail response.


I have some 10gbe nics free which i want to use only in bridgemode for the guest-vms.
I believe you mean that in Point 5

4] in my scenario this would be in my main network with all my users in the company.
This would be an bad decision regarding interruption of the ceph public transfers?
Or you would say doesnt matter?
Or i should use to administrate an admin pc that is in the same VLAN like the public?
For the security it is not so important to do that.
But if it is required to do that regarding performance we must do it.

thanks
I have little problem decyphering "connection for clients to the backend". What is client, what is backend?

If you have pve frontend on the same subnet as all users in company - from my point of view its security and performance NO. Any broadcast can overhelm this subnet. Create management vlan/subnet for pve frontend accessible via firewall.
 
2. Corosync 1gbit dual nic full mesh for reduce latency. Dedicated
3. PVE-Connection 10gbit single
2] What you will do, when you connect other pve host? You will need rework corosync network. Think twice about it. 1G switches are cheap.
3] What if your connection/switch fail in backup/vm migration/etc? Use LACP and don't even think about single connection.
 
2] What you will do, when you connect other pve host? You will need rework corosync network. Think twice about it. 1G switches are cheap.
I thought that this is only for corosync communication and as admin i dont need to listen in real on this connection.
For doing my maintenance or status connection i can see that over the gui or cli at proxmox itself.

3] What if your connection/switch fail in backup/vm migration/etc? Use LACP and don't even think about single connection.
for sure yes you are right. it is an important point.
But lets say i do LACP in my understanding the point of failure is the switch itself or?
If the switch dies the LACP doesnt help.
Only when i use two nics with each 10gbe i could do aggregation over 2 nics to two ports to the same switch.
Would not help if the switch is dieing but if one nic dies.

I have little problem decyphering "connection for clients to the backend". What is client, what is backend?
I mean with backend the gui of pve or the listening ports for (CLI) ssh.
If you have pve frontend on the same subnet as all users in company - from my point of view its security and performance NO. Any broadcast can overhelm this subnet. Create management vlan/subnet for pve frontend accessible via firewall.
OK must see how i can arrange that.
 
Of cause it is easier concerning scalability to use switches in the corosync link. (In all links)
But in the end with a mesh network there will be always a scalability problem. If you want to add nodes you will also need a storrage connection..
It is just a question of what you have planned. If you want to expand with more nodes in the next years it is probably better using switches from the beginning. If you just want to run this infrastructure as build and rebuild a new one if you need more capacity ect. it can be done with the mesh.
Concerning the Proxmox Public VM's network you can work with VLANS on your 10GBit connection. VLAN für the management connection, VLAN for the VM's traffic, ect... In the best case there is also a redundant link for this (2 switches, 2x 10 GBit on each node, LACP for the connections to the nodes and a mlag between the switches. (With the mlag the LACP runs over the two switches and you will have the full redundancy)
In the end it is just a question of the money you can/want to spend.
 
Last edited:
Hello there
I have a NAS witch comes with 2 X10GB NIC
I have 3 Servers with the following Network capacity
1 X 2 10GB
1 X 4 1GB
According what i read here, a network separation scenario could be :

  • 1 x 10GB for NAS Connection
  • 1 x 10GB for Ceph connection (Cluster and public)
  • 1 x 1GB PVE Management connection
  • 1 x 1GB PVE cluster (corosync)
  • 2 x 1GB (LACP bond) PVE VMs network

Also a question i may ask is : where to add the Gateway setting ?
  1. on Vms network, or
  2. on Management Network
 
  • 1 x 10GB for NAS Connection
  • 1 x 10GB for Ceph connection (Cluster and public)
  • 1 x 1GB PVE Management connection
  • 1 x 1GB PVE cluster (corosync)
  • 2 x 1GB (LACP bond) PVE VMs network
Looks okay. There is no redundancy though for the NAS and Ceph network. Also, keep in mind that Corosync can handle up to 8 links. Therefore it is a good idea to configure additional links on the other networks, to give it options, should the dedicated Corosync link be down.

Also a question i may ask is : where to add the Gateway setting ?
  1. on Vms network, or
  2. on Management Network
On the network where the gateway is available and on which you want Proxmox VE to use to access the internet, and all the other networks it might not have a direct connection to.
 
Thanks a lot for responding Aaron
Therefore it is a good idea to configure additional links on the other networks, to give it options, should the dedicated Corosync link be down.
You mean to add more NICs?

Another question is if CEPH is reliable enough for production environment comparing to LINSTOR (or maybe NFS - NAS).
Ceph minimum is 3 nodes (like my environment) but can only tolerate a single node failure whilst in an occasion with 2 nodes failure will be total destruction (???)
 
Ceph is both reliable and scalable, you can configure the number of replicas each object will have (set to 3 by default), and the minimum number of replicas each object can have (2 by default). If two nodes fail you will be bellow the minimum number of replicas for all objects and the pool will block all write operations until the situation is resolved, which is very different from total destruction considering that you still have an instance of your data.

Ceph can recover from losing an arbitrary number of OSDs and/or nodes if you provide it with enough resources. If you want to maintain normal operations with two nodes failing you would need 4 nodes in your cluster in which case Ceph could run in a degraded state (having only 2 replicas for each object) until the nodes join back the cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!