Dear Community,
I'd like to share with you my recent discoveries.
For a while, I had few hardware components laying around to provide 10GB connectivity in between my cluster of 3 Proxmox servers.
Obviously, it was the time to make an upgrade from 2.5gb to 10 GB.
But unfortunately, I'm still waiting for an efficient and affordable 2.5/10GB switch on the market.
In the meantime, let's try to make the cluster with a full mesh (routed setup) connectivity and all VM bridged within an EVPN/VxLan managed by Proxmox SDN on top of it.
Note: The following guidelines requires some advanced networking knowledges, I tried to simplify as musch as possible.
OpenFabric extends the IS-IS protocol which provides an efficient link-state routing protocol between nodes without flooding the network.
According to the version of Proxmox, you may install FRR on each node with the following command.
Update the FRR daemons settings within "/etc/frr/daemons" to enable the OpenFabric deamon.
Important note: the FRR settings are overridden by Proxmox SDN, that's why it is "not" compatibe with Proxmox EVPN.
However, it's possible to add local settings which Proxmox SDN handle it well.
Create the local FRR config file and update PVE interface definition on each node according to the table below.
Update "/etc/frr/frr.conf" and create "/etc/frr/frr.conf.local" based on the following template on all nodes:
Update "/etc/network/interfaces" based on the following on all nodes:
Note: Adjust the MTU according to the lower capability of all your inter-connected NICs.
Apply all changes by running the following command without rebooting on all nodes.
Check the results within on of your node with the FFR command.
Important Notes:
Important Note: If you don't want PVE handles outgoing traffic directly, make sure you do not configure any related VNet's subnet gateway.
Within each node, update "/etc/frr/frr.conf" as the following based on the Step 1 Table.
In my case, I was playing with virtual routers (Vyos, Bird, Calico, Celium) for a complex topology on top on the PVE VxLan's which involves BGP, VRRP, LDP, eBPF, an so on.
I had 2 PVE Hosts with an Intel x520 (chip: Intel 82599ES) and one with an HPE 557spf+ (chip: Emulex Skyhawk).
The root cause of my problems was the HPE NIC one "Node 1" which integrates the VxLan UDP checksum offload feature (rx-udp_tunnel-port-offload) whether the others don't.
To fix my issue, I had to disable it on the HPE NIC with the following command.
I didn't have enough time to troubleshoot that point.
That early draft-white-openfabric-06.txt protocol seems requiring a spine / leaf network topology.
As we are using OpenFabric with direct attached nodes and single loopback IPs, it could make sense that message shows up.
Have fun.
I'd like to share with you my recent discoveries.
For a while, I had few hardware components laying around to provide 10GB connectivity in between my cluster of 3 Proxmox servers.
Obviously, it was the time to make an upgrade from 2.5gb to 10 GB.
But unfortunately, I'm still waiting for an efficient and affordable 2.5/10GB switch on the market.
In the meantime, let's try to make the cluster with a full mesh (routed setup) connectivity and all VM bridged within an EVPN/VxLan managed by Proxmox SDN on top of it.
Note: The following guidelines requires some advanced networking knowledges, I tried to simplify as musch as possible.
Infrastructure
Rich (BB code):
┌────────────────────────┐
│ Node1 │
├────────┬────────┬──────┤
│enp2s0f0│enp2s0f1│ vmbr0├───────────────┐
└─────┬──┴──┬─────┴──────┘ |
│ │ |
┌───────┬─────┐ │ │ ┌─────┬───────┐ |
│ │ eno1├────────┘ └────────┤eno1 │ │ |
│ Node2 ├─────┤ ├─────┤ Node3 │ |
│ │ eno2├───────────────────────┤eno2 │ │ |
| ├─────┤ ├─────┤ | |
│ |vmbr0| |vmbr0| | |
└───────┴──┬──┘ └──┬──┴───────┘ |
| | |
| | |
└───────┐ ┌────────────┘ |
| | |
| | ┌────────────────────┘
| | |
┌────────────────────────┐
│ SW │
└────────────────────────┘
Node Name | Management IP | NIC 1 Name | NIC 2 Name | NIC 3 Name |
---|---|---|---|---|
Node 1 | 192.168.0.100 | vmbr0 | enp2s0f0 | enp2s0f1 |
Node 2 | 192.168.0.101 | vmbr0 | eno1 | eno2 |
Node 3 | 192.168.0.102 | vmbr0 | eno1 | eno2 |
Step 1: Prepare the underlying network with OpenFabric
Follow the Proxmox guide Full Mesh Network for Ceph Server with a few adaptations below.OpenFabric extends the IS-IS protocol which provides an efficient link-state routing protocol between nodes without flooding the network.
According to the version of Proxmox, you may install FRR on each node with the following command.
Code:
apt install frr
Update the FRR daemons settings within "/etc/frr/daemons" to enable the OpenFabric deamon.
Code:
[...]
fabricd=yes
[...]
Important note: the FRR settings are overridden by Proxmox SDN, that's why it is "not" compatibe with Proxmox EVPN.
However, it's possible to add local settings which Proxmox SDN handle it well.
Create the local FRR config file and update PVE interface definition on each node according to the table below.
Node Name | Loopback IP (<lo_IP>) | OpenFabric Netword ID <o_NID> | NIC Name 1 (<NIC1>) | NIC Name 2 (<NIC2>) | NIC's MTU <MTU> |
---|---|---|---|---|---|
Node 1 | 172.16.0.1/32 | 49.0001.1111.1111.1111.00 | enp2s0f0 | enp2s0f1 | 9000 |
Node 2 | 172.16.0.2/32 | 49.0001.2222.2222.2222.00 | eno1 | eno2 | 9000 |
Node 3 | 172.16.0.3/32 | 49.0001.3333.3333.3333.00 | eno1 | eno2 | 9000 |
Update "/etc/frr/frr.conf" and create "/etc/frr/frr.conf.local" based on the following template on all nodes:
Code:
interface lo
ip address <lo_IP>
ip router openfabric 1
openfabric passive
!
interface <NIC1>
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
interface <NIC2>
ip router openfabric 1
openfabric csnp-interval 2
openfabric hello-interval 1
openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
net <o_NID>
lsp-gen-interval 1
max-lsp-lifetime 600
lsp-refresh-interval 180
Update "/etc/network/interfaces" based on the following on all nodes:
Code:
[...]
auto <NIC1>
iface <NIC1> inet static
mtu <MTU>
auto <NIC2>
iface <NIC2> inet static
mtu <MTU>
[...]
post-up /usr/bin/systemctl restart frr.service
source /etc/network/interfaces.d/*
Note: Adjust the MTU according to the lower capability of all your inter-connected NICs.
Apply all changes by running the following command without rebooting on all nodes.
Bash:
ifreload -a
Check the results within on of your node with the FFR command.
Bash:
vtysh -c 'show openfabric route'
Code:
Area 1:
IS-IS L2 IPv4 routing table:
Prefix Metric Interface Nexthop Label(s)
--------------------------------------------------------
172.16.0.1/32 0 - - -
172.16.0.2/32 20 enp2s0f0 172.16.0.2 -
172.16.0.3/32 20 enp2s0f1 172.16.0.3 -
Step 2: Setup you EVPN
2.1 - Create an EVPN Controler
In the background, an EVPN Controller is a BGP instance which manage routes within tunnelled networks (in that case VxLan nets).- Open your Proxmox Admin web UI
- Open Datacenter > SDN > Options section
- Add an EVPN Controller
- ID: myEVPN
- ASN: 65000
(BGP ASN number must be within a private range not already used within your network) - Peers: 172.16.0.1, 172.16.0.2, 172.16.0.3
(All node loopback IPs)
2.2 - Create an EVPN zone
An EVPN zone is a VxLan zone which the routing is handled by a EVPN Controller.- Open your Proxmox Admin web UI
- Open Datacenter > SDN > Zones section
- Add an EVPN zone
- ID: evpnPRD
- Controller: myEVPN
- VRF-VXLAN Tag: 10000
- MTU: 8950
Important Notes:
- The EVPN "Primary Exit Node" seems to be required within the Web UI, select one of your nodes which will carry on outgoing EVPN traffic.
If you don't want PVE handles outgoing traffic directly, make sure you do not configure any related VNet's subnet gateway. - Adjust the MTU according to your NIC's MTU defined previously minus 50 bytes if under IPv4, minus 70 under IPv6.
MTU Considerations for VXLAN
2.3 - Create a VxLan VNet
- Open your Proxmox Admin web UI
- Open Datacenter > SDN > VNets section
- Add a VNet
- Name: vxnet1
- Zone: evpnPRD
- Tag: 10500 (VxLAN ID)
2.4 - Add subnets within your VxLan VNet
Follow the Proxmox SDN documentation SDN Controllers - SubnetsImportant Note: If you don't want PVE handles outgoing traffic directly, make sure you do not configure any related VNet's subnet gateway.
2.5 - Apply SDN changes to all your nodes
- Open your Proxmox Admin web UI
- Open Datacenter > SDN
- Click on Apply
2.6 - Fixing up FRR config
The Proxmox SDN EVPN plugin seems not resolving properly loopback IPs provided within the EVPN Controller which results on messing up the FRR config file.Within each node, update "/etc/frr/frr.conf" as the following based on the Step 1 Table.
- bgp router-id XXXX.XXXX.XXXX => must be the IP of the lookback address defined for OpenFabric
- neighbor XXXX.XXXX.XXXX => One line per remaining neighbor
Code:
[...]
interface lo
ip address 172.16.0.1/32
ip router openfabric 1
openfabric passive
[...]
router bgp 65000
bgp router-id 172.16.0.1
no bgp hard-administrative-reset
no bgp graceful-restart notification
no bgp default ipv4-unicast
coalesce-time 1000
neighbor VTEP peer-group
neighbor VTEP remote-as 65000
neighbor VTEP bfd
neighbor 172.16.0.2 peer-group VTEP
neighbor 172.16.0.3 peer-group VTEP
[...]
router bgp 65000 vrf vrf_evpnPRD
bgp router-id 172.16.0.1
no bgp hard-administrative-reset
no bgp graceful-restart notification
exit
[...]
Step 3: Check connectivity
- Ping from one node all other nodes with their loopback IPs defined in Step 1
- Check EVPN Controller with the command:
vtysh -c 'show bgp summary'
You should see all neighbors, example from Node 1:
Code:L2VPN EVPN Summary (VRF default): BGP router identifier 172.16.0.1, local AS number 65000 vrf-id 0 BGP table version 0 RIB entries 11, using 2112 bytes of memory Peers 2, using 1449 KiB of memory Peer groups 1, using 64 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc Node2(172.16.0.2) 4 65000 50922 50841 0 0 0 1d18h20m 5 4 N/A Node3(172.16.0.3) 4 65000 50886 50834 0 0 0 1d18h20m 8 4 N/A Total number of neighbors 2
- Create VMs (KVM or LXC) on each node
- Attach their NICs to vxnet1
- Assign manually an IP within the range defined into the Step 2.4
- Try to ping each VMs
Side notes
Packets lost
In case of disappearing packets or wrong CRC checks within virtualized machines, check your NIC hardware on each node, it's important to know that some cards integrate nice features regarding to tunneled links but it can be a nightmare to troubleshoot if they are not identical in between nodes (eg. Intel vs Mellanox vs Emulex).In my case, I was playing with virtual routers (Vyos, Bird, Calico, Celium) for a complex topology on top on the PVE VxLan's which involves BGP, VRRP, LDP, eBPF, an so on.
I had 2 PVE Hosts with an Intel x520 (chip: Intel 82599ES) and one with an HPE 557spf+ (chip: Emulex Skyhawk).
The root cause of my problems was the HPE NIC one "Node 1" which integrates the VxLan UDP checksum offload feature (rx-udp_tunnel-port-offload) whether the others don't.
To fix my issue, I had to disable it on the HPE NIC with the following command.
Bash:
ethtool -K enp2s0f0 rx-udp_tunnel-port-offload off
ethtool -K enp2s0f1 rx-udp_tunnel-port-offload off
Proxmox firewall & OpenVSwitch conflicts
Attention, if you use Proxmox firewall, VM NICs will be handled by OpenVSwitch then your VM attached to EVPN VNets will suffer of wired behaviors.OpenFabric warnings
You will see some of the following messages from syslog.
Code:
fabricd[1234]: [QBAZ6-3YZR3] OpenFabric: Could not find two T0 routers
I didn't have enough time to troubleshoot that point.
That early draft-white-openfabric-06.txt protocol seems requiring a spine / leaf network topology.
As we are using OpenFabric with direct attached nodes and single loopback IPs, it could make sense that message shows up.
Have fun.
Last edited: