[TUTORIAL] [Full mesh (routed setup) + EVPN] it is feasible even by using SDN!

vherrlein

Member
Feb 1, 2022
17
12
8
36
Dear Community,

I'd like to share with you my recent discoveries.

For a while, I had few hardware components laying around to provide 10GB connectivity in between my cluster of 3 Proxmox servers.
Obviously, it was the time to make an upgrade from 2.5gb to 10 GB.
But unfortunately, I'm still waiting for an efficient and affordable 2.5/10GB switch on the market.

In the meantime, let's try to make the cluster with a full mesh (routed setup) connectivity and all VM bridged within an EVPN/VxLan managed by Proxmox SDN on top of it.

Note: The following guidelines requires some advanced networking knowledges, I tried to simplify as musch as possible.

Infrastructure​

Rich (BB code):
                 ┌────────────────────────┐
                 │          Node1         │
                 ├────────┬────────┬──────┤
                 │enp2s0f0│enp2s0f1│ vmbr0├───────────────┐
                 └─────┬──┴──┬─────┴──────┘               |
                       │     │                            |
┌───────┬─────┐        │     │        ┌─────┬───────┐     |
│       │ eno1├────────┘     └────────┤eno1 │       │     |
│ Node2 ├─────┤                       ├─────┤ Node3 │     |
│       │ eno2├───────────────────────┤eno2 │       │     |
|       ├─────┤                       ├─────┤       |     |
│       |vmbr0|                       |vmbr0|       |     |
└───────┴──┬──┘                       └──┬──┴───────┘     |
           |                             |                |
           |                             |                |
           └───────┐        ┌────────────┘                |
                   |        |                             |
                   |        |        ┌────────────────────┘
                   |        |        |
                ┌────────────────────────┐
                │           SW           │
                └────────────────────────┘

Node NameManagement IPNIC 1 NameNIC 2 NameNIC 3 Name
Node 1192.168.0.100vmbr0enp2s0f0enp2s0f1
Node 2192.168.0.101vmbr0eno1eno2
Node 3192.168.0.102vmbr0eno1eno2

Step 1: Prepare the underlying network with OpenFabric​

Follow the Proxmox guide Full Mesh Network for Ceph Server with a few adaptations below.
OpenFabric extends the IS-IS protocol which provides an efficient link-state routing protocol between nodes without flooding the network.

According to the version of Proxmox, you may install FRR on each node with the following command.

Code:
apt install frr

Update the FRR daemons settings within "/etc/frr/daemons" to enable the OpenFabric deamon.

Code:
[...]
fabricd=yes
[...]

Important note: the FRR settings are overridden by Proxmox SDN, that's why it is "not" compatibe with Proxmox EVPN.
However, it's possible to add local settings which Proxmox SDN handle it well.

Create the local FRR config file and update PVE interface definition on each node according to the table below.

Node NameLoopback IP (<lo_IP>)OpenFabric Netword ID <o_NID>NIC Name 1 (<NIC1>)NIC Name 2 (<NIC2>)NIC's MTU <MTU>
Node 1172.16.0.1/3249.0001.1111.1111.1111.00enp2s0f0enp2s0f19000
Node 2172.16.0.2/3249.0001.2222.2222.2222.00eno1eno29000
Node 3172.16.0.3/3249.0001.3333.3333.3333.00eno1eno29000

Update "/etc/frr/frr.conf" and create "/etc/frr/frr.conf.local" based on the following template on all nodes:
Code:
interface lo
 ip address <lo_IP>
 ip router openfabric 1
 openfabric passive
!
interface <NIC1>
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface <NIC2>
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net <o_NID>
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

Update "/etc/network/interfaces" based on the following on all nodes:
Code:
[...]
auto <NIC1>
iface <NIC1> inet static
        mtu <MTU>

auto <NIC2>
iface <NIC2> inet static
        mtu <MTU>
[...]
post-up /usr/bin/systemctl restart frr.service

source /etc/network/interfaces.d/*

Note: Adjust the MTU according to the lower capability of all your inter-connected NICs.

Apply all changes by running the following command without rebooting on all nodes.
Bash:
ifreload -a

Check the results within on of your node with the FFR command.
Bash:
vtysh -c 'show openfabric route'
Code:
Area 1:
IS-IS L2 IPv4 routing table:

 Prefix         Metric  Interface  Nexthop     Label(s)
 --------------------------------------------------------
 172.16.0.1/32  0       -          -           -
 172.16.0.2/32  20      enp2s0f0   172.16.0.2  -
 172.16.0.3/32  20      enp2s0f1   172.16.0.3  -

Step 2: Setup you EVPN​

2.1 - Create an EVPN Controler​

In the background, an EVPN Controller is a BGP instance which manage routes within tunnelled networks (in that case VxLan nets).
  • Open your Proxmox Admin web UI
  • Open Datacenter > SDN > Options section
  • Add an EVPN Controller
    • ID: myEVPN
    • ASN: 65000
      (BGP ASN number must be within a private range not already used within your network)
    • Peers: 172.16.0.1, 172.16.0.2, 172.16.0.3
      (All node loopback IPs)
cf. Proxmox SDN documentation SDN Controllers - EVPN Controller

2.2 - Create an EVPN zone​

An EVPN zone is a VxLan zone which the routing is handled by a EVPN Controller.
  • Open your Proxmox Admin web UI
  • Open Datacenter > SDN > Zones section
  • Add an EVPN zone
    • ID: evpnPRD
    • Controller: myEVPN
    • VRF-VXLAN Tag: 10000
    • MTU: 8950
cf. Proxmox SDN documentation SDN Controllers - EVPN Zone

Important Notes:
  • The EVPN "Primary Exit Node" seems to be required within the Web UI, select one of your nodes which will carry on outgoing EVPN traffic.
    If you don't want PVE handles outgoing traffic directly, make sure you do not configure any related VNet's subnet gateway.
  • Adjust the MTU according to your NIC's MTU defined previously minus 50 bytes if under IPv4, minus 70 under IPv6.
    MTU Considerations for VXLAN

2.3 - Create a VxLan VNet​

  • Open your Proxmox Admin web UI
  • Open Datacenter > SDN > VNets section
  • Add a VNet
    • Name: vxnet1
    • Zone: evpnPRD
    • Tag: 10500 (VxLAN ID)
cf. Proxmox SDN documentation SDN Controllers - VNets

2.4 - Add subnets within your VxLan VNet​

Follow the Proxmox SDN documentation SDN Controllers - Subnets

Important Note: If you don't want PVE handles outgoing traffic directly, make sure you do not configure any related VNet's subnet gateway.

2.5 - Apply SDN changes to all your nodes​

  • Open your Proxmox Admin web UI
  • Open Datacenter > SDN
  • Click on Apply
Wait till all changes are applied to your nodes.

2.6 - Fixing up FRR config​

The Proxmox SDN EVPN plugin seems not resolving properly loopback IPs provided within the EVPN Controller which results on messing up the FRR config file.

Within each node, update "/etc/frr/frr.conf" as the following based on the Step 1 Table.

  • bgp router-id XXXX.XXXX.XXXX => must be the IP of the lookback address defined for OpenFabric
  • neighbor XXXX.XXXX.XXXX => One line per remaining neighbor
Sample for Node 1:

Code:
[...]
interface lo
 ip address 172.16.0.1/32
 ip router openfabric 1
 openfabric passive
[...]
router bgp 65000
 bgp router-id 172.16.0.1
 no bgp hard-administrative-reset
 no bgp graceful-restart notification
 no bgp default ipv4-unicast
 coalesce-time 1000
 neighbor VTEP peer-group
 neighbor VTEP remote-as 65000
 neighbor VTEP bfd
 neighbor 172.16.0.2 peer-group VTEP
 neighbor 172.16.0.3 peer-group VTEP
[...]
router bgp 65000 vrf vrf_evpnPRD
 bgp router-id 172.16.0.1
 no bgp hard-administrative-reset
 no bgp graceful-restart notification
exit
[...]

Step 3: Check connectivity​

  1. Ping from one node all other nodes with their loopback IPs defined in Step 1
  2. Check EVPN Controller with the command: vtysh -c 'show bgp summary'
    You should see all neighbors, example from Node 1:
    Code:
    L2VPN EVPN Summary (VRF default):
    BGP router identifier 172.16.0.1, local AS number 65000 vrf-id 0
    BGP table version 0
    RIB entries 11, using 2112 bytes of memory
    Peers 2, using 1449 KiB of memory
    Peer groups 1, using 64 bytes of memory
    
    Neighbor           V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
    Node2(172.16.0.2) 4      65000     50922     50841        0    0    0 1d18h20m            5        4 N/A
    Node3(172.16.0.3) 4      65000     50886     50834        0    0    0 1d18h20m            8        4 N/A
    
    Total number of neighbors 2
  3. Create VMs (KVM or LXC) on each node
    1. Attach their NICs to vxnet1
    2. Assign manually an IP within the range defined into the Step 2.4
    3. Try to ping each VMs

Side notes​

Packets lost​

In case of disappearing packets or wrong CRC checks within virtualized machines, check your NIC hardware on each node, it's important to know that some cards integrate nice features regarding to tunneled links but it can be a nightmare to troubleshoot if they are not identical in between nodes (eg. Intel vs Mellanox vs Emulex).
In my case, I was playing with virtual routers (Vyos, Bird, Calico, Celium) for a complex topology on top on the PVE VxLan's which involves BGP, VRRP, LDP, eBPF, an so on.
I had 2 PVE Hosts with an Intel x520 (chip: Intel 82599ES) and one with an HPE 557spf+ (chip: Emulex Skyhawk).
The root cause of my problems was the HPE NIC one "Node 1" which integrates the VxLan UDP checksum offload feature (rx-udp_tunnel-port-offload) whether the others don't.
To fix my issue, I had to disable it on the HPE NIC with the following command.

Bash:
ethtool -K enp2s0f0 rx-udp_tunnel-port-offload off
ethtool -K enp2s0f1 rx-udp_tunnel-port-offload off

Proxmox firewall & OpenVSwitch conflicts​

Attention, if you use Proxmox firewall, VM NICs will be handled by OpenVSwitch then your VM attached to EVPN VNets will suffer of wired behaviors.

OpenFabric warnings​

You will see some of the following messages from syslog.
Code:
fabricd[1234]: [QBAZ6-3YZR3] OpenFabric: Could not find two T0 routers

I didn't have enough time to troubleshoot that point.
That early draft-white-openfabric-06.txt protocol seems requiring a spine / leaf network topology.
As we are using OpenFabric with direct attached nodes and single loopback IPs, it could make sense that message shows up.

Have fun.
 
Last edited:
It was mainly to try OpenFabric as IGP protocol.
Of course I could use ISIS or even OSPF with their drawbacks too.
In my case, OpenFabric is most flexible protocol for future topology changes and more import without flooding the network.
 
Question: Have you tested/used OpenFabric with more than 2 nodes on the same interface (ie. a vlan/ethernet switch)? ran into a problem which I suspect makes it outside of OpenFabrics design criteria, so as I've implemented your stuff, it worked, but then found troubles on the links where there are 3 nodes on the same link.
 
Well in terms of topology if you have a look on the first diagram, it’s the case but with 2 links.
In your case if I understand well, you have a single hardware link on each node connected to a switch, if so, your setup would be even simpler by assigning the IP to your NIC instead of the loop back one, and not using OpenFabric.
Then your switch will do the rest by broadcasting ARP requests.
As reminder the IGP is there only for the keeping the reachability in between nodes within an “unusual” network infrastructure .
 
Last edited:
Hello,

Your tutorial is really interesting. I applied it.

About "2.6 - Fixing up FRR config", I did not have to do this, it seems it does now work.
About "OpenFabric: Could not find two T0 routers", you can get ride of this message by adding "fabric-tier 0" in router openfabric 1 section.
As your nodes are on the same tier, 0 is probably the good numbers. If you had nodes + upstream exit-nodes (it is my case), you can do fabric-tier 1 for nodes and fabric-tier 0 for exit-nodes.

Code:
router openfabric 1
[...]
fabric-tier 0
 
@gpoudrel nice catch.
I did the change into the local conf "/etc/frr/frr.conf.local" of each node and applied again Proxmox SDN then it worked fine :)

Unfortunatelly, I can't update the original post with your discovery.

According to OpenFabric specs, it means nodes with tier 0 are at the edge of the network, which in our case of full mesh topology is perfect.
Just a warning, the specs requies 2 tier 0 routers in order to calculate properly routers locations and reduce flooding.
https://datatracker.ietf.org/doc/html/draft-white-openfabric-06#section-4

The Proxmox guide Full Mesh Network for Ceph Server should be updated also.
 
Last edited:
  • Like
Reactions: alteriks
As reminder the IGP is there only for the keeping the reachability in between nodes within an “unusual” network infrastructure .
True,
I'm looking at it from the perspective of a switch/network link failure , so that is an extra link...

Og, OpenFabric routing protocol doesn't work with more than 2 nodes on the same link - design criteria
 
As reminder the IGP is there only for the keeping the reachability in between nodes within an “unusual” network infrastructure .

Yes 100% correct!
But having only a single switch and LACP/etc. stil is a SPOF on the switch, thus the idea is/was to have the direct linked interfaces too, as we are moving towards a leaf-spline/"CLOS". Even more "fun" is when that swith ports dis connect, you "lose" (interface down) the IP etc.

So yes, it is more understanding the use and design criterias of each IGP ;)

the one thing I only (the day before last) noticed, is that the OpenFabric DRAFT specifically mentioned in there the removals of IS-IS specifics as its criteria is direct links only, no "broadcast links" ;) (I missed it the first scan of the draft ;( )
 
  • Like
Reactions: vherrlein
the one thing I only (the day before last) noticed, is that the OpenFabric DRAFT specifically mentioned in there the removals of IS-IS specifics as its criteria is direct links only, no "broadcast links" ;) (I missed it the first scan of the draft ;( )
For being even more precise :
Data center network fabrics only contain point-to-point links; because of this, there is no reason to support any broadcast link types […]
https://datatracker.ietf.org/doc/html/draft-white-openfabric-06#section-2.2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!