Ceph Network public and cluster some questions

Mar 19, 2018
39
2
13
33
Dear,

I want to setup for ceph an full mesh network over dual 100gbe nic.
In the documentation "Deploy Hyper-Converged Ceph Cluster" it is recommended to split the public and cluster network of ceph.
If i am right, i have written that is recommended to have redundancy in the ceph network.

To my question.
If it is both recommended and i separate the "network" and "cluster" network and want to have additional the redundancy for ceph network.
I reach that with one dual(1x2) nic or i need two dual nics (2x2)?

If i need 2x2:
It is needed to have for public and cluster 100gbe each?
Or for one of both an 10gb would be enough?

I want to do an allNVMe Setup with mellanox 9300 MAX over 3 nodes.
Beginning 4 NVMe by node is planed.
Maximum is 8 NVMe by node

Many thanks for help
 
Oct 14, 2020
77
11
8
I have my CEPH Cluster Setup as followed:

3 Servers with 2 RJ45 1GbE Connections in Bond-Mode Failover.
Those three Servers each have a Dual SPF+ Network card.

I connected them without a switch and used bond-mode broadcast for that (Maybe that's not a such good idea with 100GbE)
My Cluster Network is on the SPF+ Network and my public network on thr RJ45.

Works like a charm for me, although i don't really like the broadcast bond mode.

In Terms of redundancy in this three node setup it should be fine.
 
Mar 19, 2018
39
2
13
33
3 Servers with 2 RJ45 1GbE Connections in Bond-Mode Failover.
this is for (pmxcfs) corosync?

My Cluster Network is on the SPF+ Network and my public network on thr RJ45.
SFP+ 10gbe for cluster network.
You are using NVMe?
If yes how many OSDs?

So it seems to be, that the public network is not hungry for high speed as the cluster network.
Somebody else can confirm that?
Exist there an recommendation for speed in public network?

So if i am right i need to seperate for the complete cluster network 3 networks which is internal for replica and communication between nodes?

1. corosync - 1gbe
2. Ceph public - 1gbe
3. Ceph network - 10gbe or maybe 100gbe

(Maybe that's not a such good idea with 100GbE)
I think to do it in 100gbe regarding the ceph benchmark which is be done in 100gbe.
There it is recommended to use 100gbe if a setup is made with NVMe.

https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-mit-nvme.76517/

This is the block where i read the recommendation.
Page one
"Hyper-converged setups can be deployed with Proxmox VE, using a cluster that contains a minimum of three
nodes, enterprise class NVMe SSDs, and a 100 gigabit network (10 gigabit network is the absolute minimum
requirement and already a bottleneck)"

The Full Mesh Setup i want to build like on this documentation
file:///D:/NextCloud/BMS/clusterbau/Proxmox/Mesh-Network-Ceph-Crossover-Proxmox.html
 
Oct 14, 2020
77
11
8
this is for (pmxcfs) corosync?
yes

SFP+ 10gbe for cluster network.
You are using NVMe?
If yes how many OSDs?
I'm using SSD's. 3 OSD's per Server.

So it seems to be, that the public network is not hungry for high speed as the cluster network.
Somebody else can confirm that?
Exactly. I don't see a lot of traffic on my public network

file:///D:/NextCloud/BMS/clusterbau/Proxmox/Mesh-Network-Ceph-Crossover-Proxmox.html
This is probably that: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
 
Sep 15, 2020
48
5
8
28
This is, what the Wiki says (https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster):
  • Public Network: You should setup a dedicated network for Ceph, this setting is required. Separating your Ceph traffic is highly recommended, because it could lead to troubles with other latency dependent services, e.g., cluster communication may decrease Ceph’s performance, if not done.
  • Cluster Network: As an optional step you can go even further and separate the OSD replication & heartbeat traffic as well. This will relieve the public network and could lead to significant performance improvements especially in big clusters.
Concerning the "Ceph Cluster Network" from the Ceph Project (https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/):

It is possible to run a Ceph Storage Cluster with two networks: a public (client, front-side) network and a cluster (private, replication, back-side) network. However, this approach complicates network configuration (both hardware and software) and does not usually have a significant impact on overall performance. For this reason, we recommend that for resilience and capacity dual-NIC systems either active/active bond these interfaces or implemebnt a layer 3 multipath strategy with eg. FRR. If, despite the complexity, one still wishes to use two networks, each Ceph Node will need to have more than one network interface or VLAN. See Hardware Recommendations - Networks for additional details.

It could be sometimes a little bit confusing, because Proxmox has also a Cluster Network. The "Proxmox Cluster Network" is Corosync which is independent from Ceph. Ceph traffic and the Corosync traffic should always be seperated from each other, because latency problems in the corosync connection can lead to troubles with your whole cluster.

Here the recommended network structure from proxmox, which also could be realized with a mesh:
Public is the "Proxmox Public Network" for your VM's. Cluster is the "Proxmox Cluster Network" (Corosync) and Storage is your "Ceph Network" (Public and Cluster for Ceph)

Proxmox Ceph_small.PNG
 
Last edited:
Mar 19, 2018
39
2
13
33
Many thanks.
I was confused if there is required 3 dedicated networks.
Proxmox Ceph_small.PNG

Public is the "Proxmox Public Network" for your VM's. Cluster is the "Proxmox Cluster Network" (Corosync) and Storage is your "Ceph Network" (Public and Cluster for Ceph)
This means if i want to separate as recommended ceph cluster and ceph public i need in sum 3 dedicated networks including the proxmox cluster network

3 dedicated networks:

1. ceph cluster network
2. ceph public network
3. corosync

4. is the connection for clients to the backend.

-----------------------
The only question from my side which is open is the required speed in nic for allNVMe in the public network.

1. ceph cluster network (100GBe)
2. ceph public network (10 GBE????)
3. corosync (1GBe)

4. is the connection for clients to the backend. (1Gbe)

many thanks best regards
 

czechsys

Well-Known Member
Nov 18, 2015
310
25
48
If you have slots for 1Gbe, use it for corosync.
Now - split ceph to 2 networks
C1] ceph cluster (osd etc) - 100Gbe
C2] ceph public ( monitors = client access) - 10Gbe minimal

Now Proxmox side:
P1] pve cluster (corosync) - 1Gbe primary, 1Gbe secondary (or use ceph backend or pve frontend)
P2] pve frontend (management, etc) - 10Gbe minimal

Because any VM is connecting to ceph monitors and some external clients can connect too, usually C2 and P2 are the same subnet, splitting this has its cons/pros.

So if i ignore that you are doing mesh (i usually think in vlans)
1] vlan for corosync primary 1G
2] vlan for corosync secondary 1G or via 3]4]
3] vlan for ceph cluster 100G
4] vlan for pve frontend = ceph public 10G
5] vlans for VMs via 4]

So, totally 4 subnets/vlans for basic hyperconverged solution.

The main importance - think twice about subnets, especially because ceph monitors IP can't be changed such easily.
 
Last edited:
  • Like
Reactions: Sycoriorz
Sep 15, 2020
48
5
8
28
This means if i want to separate as recommended ceph cluster and ceph public i need in sum 3 dedicated networks including the proxmox cluster network

Where is the recomandation to seperate ceph cluster and ceph public network? Proxmox says, "as an optional step you can go even further and separate the OSD replication & heartbeat traffic as well. This will relieve the public network and could lead to significant performance improvements especially in big clusters." If you seperate your ceph network (storage network in the picture) you seperate just the osd replication & heardbeat from the "normal" ceph traffic. Which can be, like proxmox, a performance improvement in big clusters (i do not think that it is a big cluster, if you use a mesh network), while Ceph says it "does not usually have a significant impact on overall performance".

If you use your 100GBit link for Ceph (Ceph public and ceph cluster) you are perfetly fine. To seperate this networks is just important in bigger clusters because if you have a rebalance (OSD' crashing or nodes are down) ceph automatically distributes your data to fulfill the given size (2 or 3 or even more numbers the data exists on different OSD's). If you have many OSD's the probability that one or more have an defect is higher.

You should also think of the fact, that your Proxmox Public Network (not Ceph) is normally used for backup (if not seperated) and also for tranfering VM's from one node to another (live migration). If you migrate VM's which are on a ceph storage no data from the disks will be transfered but the RAM of the running VM's. If you want to live migrate all VM's from one node to another (in case of an update without impact on the VM's or for other maintenance reasons without downtime of the VM's) this can be all the used RAM on the node. (Sometimes hundreds of GBs) Therfore you should use an 10Gbit link if possible.

For Corosync an 1GBit Connection is fine if the latency is low enough (Test it before. Should be < 6-7ms if all nodes are communicating)

In general you should also think about redundancy of the links. Especially for the corosync network. Corosync should also be seperated with an dedicated link. A VLAN on another connections is not engought, because a higher load on the physical connection will lead to latency problems.

4] vlan for pve frontend = ceph public 10G

PVE frontend is not Ceph Public. This are totally different networks.
 
Last edited:
Mar 19, 2018
39
2
13
33
thanks for detail response.
4] vlan for pve frontend = ceph public 10G
5] vlans for VMs via 4]

I have some 10gbe nics free which i want to use only in bridgemode for the guest-vms.
I believe you mean that in Point 5

4] in my scenario this would be in my main network with all my users in the company.
This would be an bad decision regarding interruption of the ceph public transfers?
Or you would say doesnt matter?
Or i should use to administrate an admin pc that is in the same VLAN like the public?
For the security it is not so important to do that.
But if it is required to do that regarding performance we must do it.

thanks
 
Mar 19, 2018
39
2
13
33
If you seperate your ceph network (storage network in the picture) you seperate just the osd replication & heardbeat from the "normal" ceph traffic. Which can be, like proxmox, a performance improvement in big clusters (i do not think that it is a big cluster, if you use a mesh network), while Ceph says it "does not usually have a significant impact on overall performance".
So you are right. I thought i have read that in the documentation. I dont have an big cluster as you have already said.

For Corosync an 1GBit Connection is fine if the latency is low enough (Test it before. Should be < 6-7ms if all nodes are communicating)

In general you should also think about redundancy of the links. Especially for the corosync network. Corosync should also be seperated with an dedicated link. A VLAN on another connections is not engought, because a higher load on the physical connection will lead to latency problems.
So for avoid latency problems. I set corosync also on full mesh without switch. I have redundancy when i do the full mesh like desrcibed in the documentation?

If you migrate VM's which are on a ceph storage no data from the disks will be transfered but the RAM of the running VM's. If you want to live migrate all VM's from one node to another (in case of an update without impact on the VM's or for other maintenance reasons without downtime of the VM's) this can be all the used RAM on the node. (Sometimes hundreds of GBs) Therfore you should use an 10Gbit link if possible.
This makes complete sense. So in this case i need to setup 10gbit.

____________________________________

OK so i think with this informations i have understand the setup.

1. Ceph network (Cluster+Public) 100Gbit dual nic full mesh. Dedicated
2. Corosync 1gbit dual nic full mesh for reduce latency. Dedicated
3. PVE-Connection 10gbit single

I hope this is now the best usecase for my setup.
Thanks a lot.
 

czechsys

Well-Known Member
Nov 18, 2015
310
25
48
thanks for detail response.


I have some 10gbe nics free which i want to use only in bridgemode for the guest-vms.
I believe you mean that in Point 5

4] in my scenario this would be in my main network with all my users in the company.
This would be an bad decision regarding interruption of the ceph public transfers?
Or you would say doesnt matter?
Or i should use to administrate an admin pc that is in the same VLAN like the public?
For the security it is not so important to do that.
But if it is required to do that regarding performance we must do it.

thanks
I have little problem decyphering "connection for clients to the backend". What is client, what is backend?

If you have pve frontend on the same subnet as all users in company - from my point of view its security and performance NO. Any broadcast can overhelm this subnet. Create management vlan/subnet for pve frontend accessible via firewall.
 

czechsys

Well-Known Member
Nov 18, 2015
310
25
48
2. Corosync 1gbit dual nic full mesh for reduce latency. Dedicated
3. PVE-Connection 10gbit single
2] What you will do, when you connect other pve host? You will need rework corosync network. Think twice about it. 1G switches are cheap.
3] What if your connection/switch fail in backup/vm migration/etc? Use LACP and don't even think about single connection.
 
Mar 19, 2018
39
2
13
33
2] What you will do, when you connect other pve host? You will need rework corosync network. Think twice about it. 1G switches are cheap.
I thought that this is only for corosync communication and as admin i dont need to listen in real on this connection.
For doing my maintenance or status connection i can see that over the gui or cli at proxmox itself.

3] What if your connection/switch fail in backup/vm migration/etc? Use LACP and don't even think about single connection.
for sure yes you are right. it is an important point.
But lets say i do LACP in my understanding the point of failure is the switch itself or?
If the switch dies the LACP doesnt help.
Only when i use two nics with each 10gbe i could do aggregation over 2 nics to two ports to the same switch.
Would not help if the switch is dieing but if one nic dies.

I have little problem decyphering "connection for clients to the backend". What is client, what is backend?
I mean with backend the gui of pve or the listening ports for (CLI) ssh.
If you have pve frontend on the same subnet as all users in company - from my point of view its security and performance NO. Any broadcast can overhelm this subnet. Create management vlan/subnet for pve frontend accessible via firewall.
OK must see how i can arrange that.
 
Sep 15, 2020
48
5
8
28
Of cause it is easier concerning scalability to use switches in the corosync link. (In all links)
But in the end with a mesh network there will be always a scalability problem. If you want to add nodes you will also need a storrage connection..
It is just a question of what you have planned. If you want to expand with more nodes in the next years it is probably better using switches from the beginning. If you just want to run this infrastructure as build and rebuild a new one if you need more capacity ect. it can be done with the mesh.
Concerning the Proxmox Public VM's network you can work with VLANS on your 10GBit connection. VLAN für the management connection, VLAN for the VM's traffic, ect... In the best case there is also a redundant link for this (2 switches, 2x 10 GBit on each node, LACP for the connections to the nodes and a mlag between the switches. (With the mlag the LACP runs over the two switches and you will have the full redundancy)
In the end it is just a question of the money you can/want to spend.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!