Two separated full meshs for cluster / corosync in a 3-node-cluster

mhert · Jul 12, 2024

Hello Guys!

I'm setting up our new cluster at the moment.

The cluster network is a 25 GBit Full-Mesh configuration between the nodes (up and running! ;-) )

To follow the KISS principle and reduce the point(s) of failure I thougt about a second mesh for corosync (with fallback over public network).

The configuration of the cluster network looks like that (node 1 as example):

frr defaults traditional
hostname prox01
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
ip address 10.10.11.101/32
ip router openfabric 1
openfabric passive
!
interface enp65s0f0np0
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
interface enp65s0f1np1
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
net 49.0001.1111.1111.1111.00
lsp-gen-interval 1
max-lsp-lifetime 600
lsp-refresh-interval 180

Is is possible to set up a second full mesh an the same nodes for Corosync?

Thanks in advance.

itiser

mhert · Jul 15, 2024

noone?

czechsys · Jul 15, 2024

Why you don't test it? Ofc it's possible.

mhert · Jul 15, 2024

I have no clue how to modificate the config file I have posted above to create a second (separated) fabric for e.q.
IP 10.10.12.101/32...

admartinator · Jul 16, 2024

Hi mhert/itiser
i am just sitting in front of almost the same problem.

2 openfabric mesh network with 2 nic in each
( based on https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server)

maybe it is possible with a second "real" loopback device e.g. lo0 NOT lo:0 (that does not work)
but i am not able to create a second real loopback device. Maybe you habe an idea?

using a linux bridge device with no members also not work.

The solution for me is right now to use the openfabric mesh for ceph and use a second RSTP mesh for pve cluster corosync.

do you have any new ideas how to setup 2 openfabric mesh networks

IMHO:
for the second mesh network you need a new net id e. g.
net 49.0001.1111.1111.1111.00 ==> net 49.0002.1111.1111.1111.00
and a new openfabric router e. g.
router openfabric 1 ==> router openfabric 2
but my problem is the lo interface ( a interface which is always up)

BR
Martin

mhert · Jul 16, 2024

maybe we can find a solution together

I've added a second configuration (openfabric) to the nodes. now it looks like this (node1):

root@prox01:~# cat /etc/frr/frr.conf
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.

frr defaults traditional
hostname prox01
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
ip address 10.10.11.101/32
ip router openfabric 1
openfabric passive
!
interface enp65s0f0np0
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
interface enp65s0f1np1
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
net 49.0001.1111.1111.1111.00
lsp-gen-interval 1
max-lsp-lifetime 600
lsp-refresh-interval 180
!
interface lo1
ip address 10.10.12.101/32
ip router openfabric 2
openfabric passive
!
interface enp1s0f0
ip router openfabric 2
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
interface enp1s0f1
ip router openfabric 2
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
line vty
!
router openfabric 2
net 49.0002.1111.1111.1111.00
lsp-gen-interval 1
max-lsp-lifetime 600
lsp-refresh-interval 180

I can test it not until tomorrow because there are no wires on the nic's yet and I'm not on site...

Another question:

Do you get the same error on nic restart (after modification) in the webgui:

or in detail:

last 2 lines from /etc/network/interfaces (all others are nics and bonds)

source /etc/network/interfaces.d/*
post-up /usr/bin/systemctl restart frr.service

Hey, did you brought up your new loopback interface (lo1 in my case) in /etc/networt/interfaces? Maybe you must add these lines (don't know
if it's necesssary but it sounds logical):

auto lo1
iface lo1 inet loopback

Best regards...

itiser

mhert · Jul 16, 2024

i've tested every possible variation but i don't get it to work...

alexskysilk · Jul 16, 2024

mhert said:
i've tested every possible variation but i don't get it to work...

you have two options here. Unless you have NEED for openfabric, just use linux networking and the instructions provided by the docs (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) IF you do want to use openfabric, consider posting on their forum/support resources.

But the bigger question is- to what end? you only have two nics on your nodes, both of which are necessary for the mesh; what do you hope to gain? If you wish to avoid contention, use a different interface (even 1g) and call it a day.

admartinator · Jul 17, 2024

@alexskysilk
we are using 4 NICs in our node and we want to setup 2 separate/independent mesh networks (one for PVE Cluster/Corosync and one for Ceph)

Code:

Cluster/Corosync


                 VPVE01
                |---|---|
                | e | e |
                | n | n |
                | s | s |
                | 1 | 2 |
                | 9 | 0 |
                |---|---|
 VPVE02           |   |          VPVE03
|-----------|     |   |         |-----------|
| e n s 1 9 |-----|-| |---------| e n s 1 9 |
|-----------|     | |           |-----------|
| e n s 2 0 |-----| |-----------| e n s 2 0 |
|-----------|                   |-----------|





CEPH
                 VPVE01
                |---|---|
                | e | e |
                | n | n |
                | s | s |
                | 2 | 2 |
                | 1 | 2 |
                |---|---|
 VPVE02           |   |          VPVE03
|-----------|     |   |         |-----------|
| e n s 2 1 |-----|-| |---------| e n s 2 1 |
|-----------|     | |           |-----------|
| e n s 2 2 |-----| |-----------| e n s 2 2 |
|-----------|                   |-----------|

@mhert
no good news but it was the same at mine
i got it "working " deleteing to lo device in the config and set the corresponing ip addresses to every interface.
as i figured out, only one OpenFabric Network is possible within this setup so only one router.
if i setup all 4 Nics to the same Fabric-net and set the Ip addresses on each nich (same ip for 2 Nic) then it was working correctly but
if e. g.one link is down (Host A to B, NIC LAN A), , the traffic goes over Host A to B, NIC LAN B. From the "Network" side, this makes sense and is the best routing, but this is not what we want, Here we habe ONE mesh with 4 NICs in each node. Maybe we can adjust here with metric/cost/prio parameter but i do not think we will get this working here.
I am testing my setup in a nested PVE-Cluster and all NICs have the same speed here. So maybe with different speed the automatic routing is different.
I do not think will achieve sucess here.

In the fabrid docu:
https://docs.frrouting.org/en/latest/fabricd.html
you will see ".....s a routing protocol derived from IS-IS,....."

The IS-IS routing protocol is also implemented in frr
https://docs.frrouting.org/en/latest/isisd.html
and in a fast overview. the parameters and structure is very similar.

My next step is trying to build a fabric-mesh with 2 NICs (as described the Proxmox-Docu and is working, 49.0001.XXXX....) and and additional config in frr.conf for IS-IS routing (with different network e.g 49.0002.XXXX..... )
In the first try implement lo interface (btw.

Code:

auto lo1
iface lo1 inet loopback

is not working for me. no lo1 is coming up. working with aliases e. g. lo:1 is also not working because interface is down in vtysh show interfaces)
in both setups with ip and if this is not working
no lo interface and setting up ip address in every interace.

Unfortunately I won't have time to test until next week.

BR
Martin

mhert · Jul 17, 2024

@alexskysilk

I have 8 interfaces per node (2x 25G / 2x 10G / 2x1G / 2x1G) and i want to avoid the use of a switch for ceph and cluster/corosync as it reduces the points of failure (and there is no need for external connection).

So I want 2 separated frr routers for ceph (25G) and cluster/corosync (1G) to build a full mesh for each.

Is this possible by different namespaces (and loopback interfaces) or am I on the wrong path?

alexskysilk · Jul 17, 2024

I cant speak of doing it with frr (never tried)

but its relatively simple to do it with linux networking.

here is a sample interfaces file (based on @admartinator interfaces; IP ranges are arbitrary)

Code:

# Node 1

# Corosync-n2 connection
auto ens19
iface ens19 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.51/32 dev ens19
        down ip route del 10.15.15.51/32

# Corosync-n3 connection
auto ens20
iface ens20 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.52/32 dev ens20
        down ip route del 10.15.15.52/32

# Ceph-n2 connection
auto ens21
iface ens21 inet static
        address  10.15.16.50
        netmask  255.255.255.0
        up ip route add 10.15.16.51/32 dev ens21
        down ip route del 10.15.16.51/32

# Ceph-n3 connection
auto ens22
iface ens22 inet static
        address  10.15.16.50
        netmask  255.255.255.0
        up ip route add 10.15.16.52/32 dev ens22
        down ip route del 10.15.16.52/32

repeat for the other nodes. edit- interface names

admartinator · Jul 18, 2024

Thanks alexskysilk,
but this is the "simple routed" solution from https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
This can not handle a loss of one connection.
Here you can run in trouble (especially with ceph, depending where the active services are) if e. g.
connection A ==> B is broken.
C can see (and only C can see) all other nodes A and B and for C, everything is OK while
A cannot see B and B cannot see A.
and A & B can see C.

This is the reason why we want to setup 2 seperated mesh networks with frr.

As already written above

"The solution for me is right now to use the openfabric mesh for ceph and use a second RSTP mesh for pve cluster corosync."

I think i will stay by this solution and life with a slower performance in corosync/migration network with RSTP.
All other solutions will be good for lab but not for productive use.

BR
Martin

mhert · Jul 19, 2024

@admartinator

Did you read my question above?

Do you get the same error on nic restart (after modification) in the webgui:

or in detail:

last 2 lines from /etc/network/interfaces (all others are nics and bonds)

source /etc/network/interfaces.d/*
post-up /usr/bin/systemctl restart frr.service

mhert · Jul 19, 2024

what's wrong with the post-up command?

admartinator · Jul 19, 2024

@mhert
yes i did but i thougt you have solved it because you were able to test every possible config

i do not get the error, i gues the problem here is your /etc/interfaces

Code:

auto lo1
iface lo1 inet loopback

try to run ifreload -a via ssh manually, here you will see more details...

mhert · Jul 19, 2024

By the way:

Can someone tell me which traffic goes through which connection on a cluster?

Throught which network goes traffic (oobe) of (builtin) backup / corosync / cluster (same as corosync?) / migration?

Is there an useful network diagramm of proxmox cluster with ceph?

mhert · Jul 19, 2024

I reverted the "lo1-thing". This could not be the problem.

As mentioned in the manual you have to add the line

"post-up /usr/bin/systemctl restart frr.service"

in /etc/network/interfaces to reload the service after config upgrades in gui.

And this throws an error ("ifreload -a" is automatically executed by clicking "Apply configuration" after modification of the network config in gui).

https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#/etc/network/interfaces

mhert · Jul 19, 2024

when I fire up "ifreload -a" in shell i get the same error as mentioned above (not more).

but when I execute "/usr/bin/systemctl restart frr.service" everything seems to be ok.

didn't you add the line in your config?

admartinator · Jul 19, 2024

during testing with faulty configs an too much reloads of frr service cann also cause the problem that the service is not started anymore becaue of too much restarts

are you able to restart the servide via ssh?

adding

Code:

post-up /usr/bin/systemctl reset-failed frr.service
post-up /usr/bin/systemctl restart frr.service

solved the problem for me

mhert · Jul 19, 2024

I solved the problem by changing the order of the commands:

nOK

source /etc/network/interfaces.d/*
post-up /usr/bin/systemctl restart frr.service

OK

post-up /usr/bin/systemctl restart frr.service
source /etc/network/interfaces.d/*

P.S. I didn't add the line "source ..."...

Two separated full meshs for cluster / corosync in a 3-node-cluster

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Distinguished Member

New Member

Well-Known Member

@alexskysilk​

Distinguished Member

New Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

We value your privacy

@alexskysilk