I have a three node Promxox 9 cluster. Each node has a 1GbE NIC and a dual-25GbE NIC (Connect-X 4 Lx). I've been trying unsuccessfully for several days to use Proxmox's SDN capabilities to create a mesh network between the three nodes (by directly connecting the nodes with SFP28 cables) and get all of the following working:
However, once I try to add multiple Exit Nodes and/or multiple BGP controllers, everything falls apart.
In the majority of my testing, I've been using a container
The TL;DR is that no matter what I try, I will lose partial or all connectivity between my LAN and my VMs. The most consistent issue I run into is that e.g. I have VMs, 2x Exit Nodes and 2x BGP controllers running on
The
I've tried literally hundreds of configurations (including changes to SDN (BGP, EVPN Zone etc.), host networking, Linux tunables etc.) to get this to work, including:
For reference, here's my current (stable) setup that hits goals 1 through 4 (but not 5 and 6), using a single BGP controller + exit node:
- Ceph storage running on the mesh network at 25GbE
- Inter-VM communication running on the mesh network at 25GbE
- VMs can reach the LAN + WAN
- (e)BGP to my OPNsense router so that external clients can reach the VMs
- (e)BGP with multiple BGP controllers / Exit Nodes (for HA)
- VMs can talk to Ceph controller (e.g. for Kubernetes) at 25GbE
However, once I try to add multiple Exit Nodes and/or multiple BGP controllers, everything falls apart.
In the majority of my testing, I've been using a container
10.255.69.31 and VM 10.255.69.41 on prox01, and a container 10.255.69.34 and VM 10.255.69.44 on prox04 (all IPs within the VXLAN subnet). The TL;DR is that no matter what I try, I will lose partial or all connectivity between my LAN and my VMs. The most consistent issue I run into is that e.g. I have VMs, 2x Exit Nodes and 2x BGP controllers running on
prox01 and prox04 and on my OPNsense I'll receive routes that look something like:| Valid | Best | Internal | Network | Next Hop | Metric | LocPrf | Weight | Path | Origin |
| y | y | n | 10.255.69.0/24 | 10.4.10.31 | 0 | 0 | 65430 | ? | |
| y | n | n | 10.255.69.0/24 | 10.4.10.34 | 0 | 0 | 65430 | ? | |
| y | y | n | 10.255.69.31/32 | 10.4.10.34 | 0 | 0 | 65430 | IGP | |
| y | y | n | 10.255.69.41/32 | 10.4.10.34 | 0 | 0 | 65430 | IGP | |
| y | y | n | 10.255.69.34/32 | 10.4.10.31 | 0 | 0 | 65430 | IGP | |
| y | y | n | 10.255.69.44/32 | 10.4.10.31 | 0 | 0 | 65430 | IGP |
The
/32 routes are all for the wrong nodes, and I can't ping any of these IPs from my desktop. Depending on the config, I'll be able to get a single ping off and then they'll stop responding. If I stop the ping and wait ~5 mins, I'll be able to do a single ping again.I've tried literally hundreds of configurations (including changes to SDN (BGP, EVPN Zone etc.), host networking, Linux tunables etc.) to get this to work, including:
- EVPN controller + BGP controllers + OPNsense all same ASN
- EVPN controller + BGP controllers same ASN and OPNsense different
- EVPN controller + BGP controllers + OPNsense all different ASNs
- BGP controllers + OPNsense same ASN and EVPN controller different
- All BGP controllers unique ASNs, EVPN controller and OPNsense unique
- https://forum.proxmox.com/threads/bgp-controller-breaks-evpn.169177/post-789232
- https://forum.proxmox.com/threads/exitnode-local-routing-breaks-evpn-sdn.153139/
- https://forum.proxmox.com/threads/evpn-sdn-not-forwarding-traffic-to-host-with-ct.167174/post-797405
- https://forum.proxmox.com/threads/management-plan-vs-vm-on-overlay.162892/post-755005
For reference, here's my current (stable) setup that hits goals 1 through 4 (but not 5 and 6), using a single BGP controller + exit node:
- OPNsense:
10.4.10.1 prox01:10.4.10.31prox03:10.4.10.33prox04:10.4.10.34
Code:
# On prox01
> cat /etc/network/interfaces
auto lo
iface lo inet loopback
iface enx5847ca7b312c inet manual
iface enp1s0f0np0 inet manual
mtu 9000
iface enp1s0f1np1 inet manual
mtu 9000
auto vmbr0
iface vmbr0 inet static
bridge-ports enx5847ca7b312c
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
auto vmbr0.10
iface vmbr0.10 inet static
address 10.4.10.31/24
gateway 10.4.10.1
Code:
# On prox01
> cat /etc/network/interfaces.d/sdn
auto vrf_evpnzone
iface vrf_evpnzone
vrf-table auto
post-up ip route del vrf vrf_evpnzone unreachable default metric 4278198272
auto vrfbr_evpnzone
iface vrfbr_evpnzone
bridge-ports vrfvx_evpnzone
bridge_stp off
bridge_fd 0
mtu 8950
vrf vrf_evpnzone
auto vrfvx_evpnzone
iface vrfvx_evpnzone
vxlan-id 10000
vxlan-local-tunnelip 10.255.0.31
bridge-learning off
bridge-arp-nd-suppress on
mtu 8950
auto vxlan_vxnet1
iface vxlan_vxnet1
vxlan-id 10500
vxlan-local-tunnelip 10.255.0.31
bridge-learning off
bridge-arp-nd-suppress on
mtu 8950
auto vxnet1
iface vxnet1
address 10.255.69.1/24
post-up iptables -t nat -A POSTROUTING -s '10.255.69.0/24' -o vmbr0.10 -j SNAT --to-source 10.4.10.31
post-down iptables -t nat -D POSTROUTING -s '10.255.69.0/24' -o vmbr0.10 -j SNAT --to-source 10.4.10.31
post-up iptables -t raw -I PREROUTING -i fwbr+ -j CT --zone 1
post-down iptables -t raw -D PREROUTING -i fwbr+ -j CT --zone 1
hwaddress BC:24:11:D8:8F:70
bridge_ports vxlan_vxnet1
bridge_stp off
bridge_fd 0
mtu 8950
ip-forward on
arp-accept on
vrf vrf_evpnzone
auto dummy_prox-of
iface dummy_prox-of inet static
address 10.255.0.31/32
link-type dummy
ip-forward 1
auto dummy_prox-of
iface dummy_prox-of inet6 static
address fc69:cefe:255::31/128
link-type dummy
ip-forward 1
auto enp1s0f0np0
iface enp1s0f0np0
ip-forward 1
auto enp1s0f1np1
iface enp1s0f1np1
ip-forward 1
Code:
> cat /etc/pve/sdn/*
evpn: proxevpn
asn 65430
fabric prox-of
bgp: bgpprox01
asn 65430
node prox01
peers 10.4.10.1
bgp-multipath-as-path-relax 1
ebgp 1
loopback dummy_prox-of
openfabric_fabric: prox-of
csnp_interval 2
hello_interval 1
ip6_prefix fc69:cefe:255::/64
ip_prefix 10.255.0.0/24
openfabric_node: prox-of_prox01
interfaces name=enp1s0f0np0
interfaces name=enp1s0f1np1
ip 10.255.0.31
ip6 fc69:cefe:255::31
openfabric_node: prox-of_prox03
interfaces name=enp1s0f0np0
interfaces name=enp1s0f1np1
ip 10.255.0.33
ip6 fc69:cefe:255::33
openfabric_node: prox-of_prox04
interfaces name=enp65s0f0np0
interfaces name=enp65s0f1np1
ip 10.255.0.34
ip6 fc69:cefe:255::34
{"zones":{"evpnzone":{"subnets":{"10.255.69.0/24":{"ips":{"10.255.69.1":{"gateway":1}}}}},"evpnPRD":{"subnets":{}},"epvnzone":{"subnets":{}}}}subnet: evpnzone-10.255.69.0-24
vnet vxnet1
gateway 10.255.69.1
snat 1
vnet: vxnet1
zone evpnzone
tag 10500
evpn: evpnzone
controller proxevpn
vrf-vxlan 10000
exitnodes prox01
ipam pve
mac BC:24:11:D8:8F:70
mtu 8950
Code:
# On all nodes
> cat /etc/sysctl.d/zzz-network.conf
net.ipv4.ip_forward=1
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0
Code:
# On prox01
> cat /etc/frr/frr.conf
frr version 10.3.1
frr defaults datacenter
hostname prox01
log syslog informational
service integrated-vtysh-config
!
!
vrf vrf_evpnzone
vni 10000
exit-vrf
!
router bgp 65430
bgp router-id 10.4.10.31
no bgp default ipv4-unicast
coalesce-time 1000
bgp disable-ebgp-connected-route-check
bgp bestpath as-path multipath-relax
neighbor BGP peer-group
neighbor BGP remote-as external
neighbor BGP bfd
neighbor 10.4.10.1 peer-group BGP
neighbor VTEP peer-group
neighbor VTEP remote-as 65430
neighbor VTEP bfd
neighbor VTEP update-source dummy_prox-of
neighbor 10.255.0.33 peer-group VTEP
neighbor 10.255.0.34 peer-group VTEP
!
address-family ipv4 unicast
network 10.4.10.31/32
neighbor BGP activate
neighbor BGP soft-reconfiguration inbound
import vrf vrf_evpnzone
exit-address-family
!
address-family ipv6 unicast
import vrf vrf_evpnzone
exit-address-family
!
address-family l2vpn evpn
neighbor VTEP activate
neighbor VTEP route-map MAP_VTEP_IN in
neighbor VTEP route-map MAP_VTEP_OUT out
advertise-all-vni
exit-address-family
exit
!
router bgp 65430 vrf vrf_evpnzone
bgp router-id 10.255.0.31
no bgp hard-administrative-reset
no bgp graceful-restart notification
!
address-family ipv4 unicast
redistribute connected
exit-address-family
!
address-family ipv6 unicast
redistribute connected
exit-address-family
!
address-family l2vpn evpn
default-originate ipv4
default-originate ipv6
exit-address-family
exit
!
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip prefix-list only_default seq 1 permit 0.0.0.0/0
!
ipv6 prefix-list only_default_v6 seq 1 permit ::/0
!
route-map MAP_VTEP_IN deny 1
match ip address prefix-list only_default
exit
!
route-map MAP_VTEP_IN deny 2
match ipv6 address prefix-list only_default_v6
exit
!
route-map MAP_VTEP_IN permit 3
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
route-map correct_src permit 1
match ip address prefix-list loopbacks_ips
set src 10.4.10.31
exit
!
ip protocol bgp route-map correct_src
router openfabric prox-of
net 49.0001.0102.5500.0031.00
exit
!
interface dummy_prox-of
ipv6 router openfabric prox-of
ip router openfabric prox-of
openfabric passive
exit
!
interface enp1s0f0np0
ipv6 router openfabric prox-of
ip router openfabric prox-of
openfabric hello-interval 1
openfabric csnp-interval 2
exit
!
interface enp1s0f1np1
ipv6 router openfabric prox-of
ip router openfabric prox-of
openfabric hello-interval 1
openfabric csnp-interval 2
exit
!
access-list pve_openfabric_prox-of_ips permit 10.255.0.0/24
!
ipv6 access-list pve_openfabric_prox-of_ip6s permit fc69:cefe:255::/64
!
route-map pve_openfabric permit 100
match ip address pve_openfabric_prox-of_ips
set src 10.255.0.31
exit
!
route-map pve_openfabric6 permit 110
match ipv6 address pve_openfabric_prox-of_ip6s
set src fc69:cefe:255::31
exit
!
ip protocol openfabric route-map pve_openfabric
!
ipv6 protocol openfabric route-map pve_openfabric6
!
!
line vty