SDN / EVPN - can we use VRF's to keep EVPN/BGP away from Hypervisor Mangement Routing?

rendrag

New Member
Dec 10, 2024
6
1
3
Hello,

I'm currently labbing up proxmox to see if we can replace our VMWare/NSX-T deployments with it. The initial test looked really promising. Then I went to do a closer-to-production deployment, with separated management and routing networks, and it's all fallen apart. Here's the basic networking overview for what I'm trying to do.

0fc6b7c9-a990-42ac-97cd-96b1ff3c2d41.png
With the cluster formed, and management configured, it was working great. Setup EVPN, BGP controllers, and it just wasn't working.. SSH'd into a hypervisor, and looked at the routing table, and realised it isn't creating a VRF/route table to do all the base EVPN routing in - it's just inserting the default routes from BGP into the base routing table, so now there are two separate sets of default routes - one out the management network, one out the public network (i.e. via the IP's of 10G-P1 and 10G-P2 on VLAN2 in the above diagram).

The basic premise we're aiming for is that the hypervisors must only be reachable on the MGMT network, and must only be able to talk outbound via the MGMT network. VM's behind EVPN must only be able to talk outbound via VLAN2 networking (or on a trunked vlan, but I'm not testing that right, now as I figure that's 'normal' functionality)

Did I miss a tickbox somewhere to tell it that the EVPN routing must be separate from the hypervisor routing? Or is this not possible with the Proxmox SDN as currently implemented? Would I be better just using VXLAN vnets, and then running a couple of VyOS VM's inside the cluster doing the BGP+EVPN part of the equation?

Thankyou!
 
Last edited:
SDN should create a separate VRF for each zone. So if you create an EVPN controller and attach it to a zone, it should insert the learned routes into the VRF of the zone, not the default VRF. So EVPN routing should already be separate from the hypervisor routing. Which routes are you seeing in the default routing table that shouldn't be there? The routes for the underlay network? The BGP controller is for creating an underlay network, which gets handled by the default routing table.
 
  • Like
Reactions: Johannes S
Thanks Stefan,

Yeah, I was expecting a separate VRF for each zone, but the BGP controllers seem to just be putting the routing for the EVPN zones into the default routing table.
I feel like I've missed something important here that I'm not quite putting my finger on.. Are the BGP controllers not the routing link between the EVPN zone(s) and external upstream routers?

Edit: No, you're kind of right - the routes for the interfaces in the EVPN zones are in VRF's - but the routes being imported by the BGP controllers (i.e. default routes) are going into the main routing table, which is affecting the hypervisor connectivity - and outbound routing from the EVPN VRF's is using the main routing table, which is using a mix of BGP and static routes.

i.e. routing table looks like:
inside of frr vtysh:
Bash:
proxmox-hv-01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

B>* 0.0.0.0/0 [20/0] via 100.127.254.113, bond0.2279, weight 1, 00:17:59
  *                  via 100.127.254.114, bond0.2279, weight 1, 00:17:59
C>* 10.68.70.0/24 is directly connected, vmbr0, 00:18:05
B>* 27.50.65.0/28 [20/0] is directly connected, vnet1 (vrf vrf_evx1), weight 1, 00:18:04
C>* 100.126.9.16/28 is directly connected, bond0.2278, 00:18:05
C>* 100.127.254.112/29 is directly connected, bond0.2279, 00:18:05

And back out in the shell:
Bash:
root@proxmox-hv-01:~# ip route
default via 10.68.70.1 dev vmbr0 proto kernel onlink
default nhid 28 proto bgp metric 20
        nexthop via 100.127.254.114 dev bond0.2279 weight 1
        nexthop via 100.127.254.113 dev bond0.2279 weight 1
10.68.70.0/24 dev vmbr0 proto kernel scope link src 10.68.70.121
27.50.65.0/28 nhid 22 dev vnet1 proto bgp metric 20
100.126.9.16/28 dev bond0.2278 proto kernel scope link src 100.126.9.20
100.127.254.112/29 dev bond0.2279 proto kernel scope link src 100.127.254.115

I shouldn't technically even see that 27.50.65.0/28 route in the main routing table, but I guess because frr is leaking it into the main table for the BGP controller, it's ending up there?

Should I not be creating the vlan 2279 (what bgp connects over) as a simple vlan, and be creating it as an SDN vlan so that it's in a vrf? or would the BGP controller still end up pushing the routing into the main routing table?
 
Last edited:
but the routes being imported by the BGP controllers (i.e. default routes) are going into the main routing table, which is affecting the hypervisor connectivity - and outbound routing from the EVPN VRF's is using the main routing table, which is using a mix of BGP and static routes.

I shouldn't technically even see that 27.50.65.0/28 route in the main routing table, but I guess because frr is leaking it into the main table for the BGP controller, it's ending up there?

There are different address families in play: IPv4/6 and L2VPN EVPN. BGP controller is for IPv4/6 routes and imports them into the default routing table. EVPN controller is for L2VPN EVPN routes and imports them into the respective VRF (depending on the RT). If you want to externally announce routes into your EVPN network, you need to do this via via the L2VPN EVPN address family, not the IPv4/6 families. It seems like the problem is that you are announcing the routes in the IPv4/6 address family instead, which causes them to be imported into the default routing table. You need to setup your external peer to properly announce the routes in the correct address family.

The BGP controller is for when you want to use BGP as the IGP, not for announcing routes for the overlay network (= EVPN). So if you don't want to use BGP as your IGP you don't need a BGP controller at all.
 
Thanks Stefan,

So it sounds like no, it's not possible to have the Proxmox EVPN peer with an external BGP peer without mixing with the hypervisor management routing table at this time? I did try adding the BGP peers to the evpn controller peer list, but this caused the frr config to add them as peers in the VTEP peer group, and add them to the l2vpn evpn address-family, which of course caused the switches to reject the BGP connections, as they were expecting ipv4/ipv6 unicast address-family connections.

I'll have a look at the option of running a pair of HA VyOS vm's in-stack per cluster as EVPN peers to do the EVPN>BGP routing, although I really wanted something UI-based, as that leaves it in SysEng control, and not needing NOC, which makes it more likely to be accepted as a replacement for vmware.

Thanks,

Damien
 
So it sounds like no, it's not possible to have the Proxmox EVPN peer with an external BGP peer without mixing with the hypervisor management routing table at this time? I did try adding the BGP peers to the evpn controller peer list, but this caused the frr config to add them as peers in the VTEP peer group, and add them to the l2vpn evpn address-family, which of course caused the switches to reject the BGP connections, as they were expecting ipv4/ipv6 unicast address-family connections.
It is possible, you just need to announce the routes in the correct address familiy (l2vpn evpn). It depends on what your switches can do. Sadly, EVPN functionality is usually quite costly and only available on high-end switches / higher license tiers.
 
It is possible, you just need to announce the routes in the correct address familiy (l2vpn evpn). It depends on what your switches can do. Sadly, EVPN functionality is usually quite costly and only available on high-end switches / higher license tiers.
I feel like that's really just missing functionality on the proxmox SDN side really.. Other places we have EVPN, we hapily peer BGP IPV4 unicast to the TOR, keeping the routing table inside a VRF on the EVPN side, and not mixing with management routing table, it's just a matter of having the correct configuration in the frr bgp instances. Saying that you need to go and buy extra hardware to do something that is extremely simple to do in FRR is really quite the cop-out. :(

Could I ask that you update the SDN documentation to make it very clear that it is not possible to use the SDN VRF's quite the way the documentation makes it look, and management networking will always be mixed with EVPN networking as soon as you enable a BGP controller, no matter the state of the 'exit nodes local routing' on the zone configuration page?

I was really hoping Proxmox was going to be the 'silver bullet' to migrating all our hypervisors away from VMWare, but I'm starting to see why our competitors are going and rolling their own platforms. I realllllly don't want to roll our own though, I'm so tired of rolling our own :(
 
I feel like that's really just missing functionality on the proxmox SDN side really.. Other places we have EVPN, we hapily peer BGP IPV4 unicast to the TOR, keeping the routing table inside a VRF on the EVPN side, and not mixing with management routing table, it's just a matter of having the correct configuration in the frr bgp instances. Saying that you need to go and buy extra hardware to do something that is extremely simple to do in FRR is really quite the cop-out. :(
I'm not sure we're 100% on the same page here, so for clarification: You are talking about the routes in the overlay network, right? (the routes for the guests in the VXLAN networks). Because EVPN uses the address-family specifically created for it in MP-BGP. If you could point me to the specific solution you are talking about, I'd happily look at it and see what we can do. Do you maybe mean the Route Server mode in NSX-T?

Could I ask that you update the SDN documentation to make it very clear that it is not possible to use the SDN VRF's quite the way the documentation makes it look, and management networking will always be mixed with EVPN networking as soon as you enable a BGP controller, no matter the state of the 'exit nodes local routing' on the zone configuration page?
This is not always the case, I have several EVPN clusters with EVPN and BGP controllers and routes do not get mixed. As I said before, BGP controller is for IPv4-unicast routes in the default routing table, EVPN controller is for l2vpn-evpn routes in VRFs. Can you post your the output of the following commands, so I can see your configuration? That should help clear it up a bit.

Do you have an example configuration for what you're trying to do in FRR? Then please attach it as well. Is it simply creating a new VRF and then announcing IP prefix routes (which would correspond to EVPN type 5 routes)?

Code:
vtysh -c 'show bgp summary'
vtysh -c 'show bgp neighbors'
vtysh -c 'show bgp l2vpn evpn route'
vtysh -c 'show bgp ipv4'

cat /etc/pve/sdn/zone.cfg
cat /etc/pve/sdn/controller.cfg
cat /etc/pve/sdn/vnet.cfg
cat /etc/pve/sdn/subnets.cfg

cat /etc/frr/frr.conf
 
One more thing that came to mind: You can always do full-mesh peering between the PVE nodes instead of using a RR and then you would not need a switch that is able to speak l2vpn-evpn BGP. But you lose the upsides of using a RR.
 
Hey Stefan,

So I spent the last three days labbing this up so I could give you some sample FRR configs, and I can now see WHY the Proxmox SDN doesn't offer this functionality. While I can get Huawei, Cisco, and Extreme all happily doing exactly what I want, Linux (plain debian+FRR and VyOS) just will not establish a BGP session with source/destination addresses inside a VRF.

I think we'll just have to trunk vlans and do routing externally, which is continuing old grossness, but will meet the 'hypervisors must not be publically accessible' requirement I've been given :\

Thanks for your help

Damien
 
  • Like
Reactions: Johannes S
You certainly caught my interest with your setup ;). I am currently working on improving the existing EVPN features so I'm always interested in what kind of setups people want to run. If I understand correctly you want to have a separate VRF that contains the routes for the BGP peers and make FRR use that VRF? Do you also want to import the learned routes into that VRF or a separate VRF?

Maybe I can look into this in the coming days/weeks when I got some more time on my hands, would be interesting to see if there's something I can come up with.
 
  • Like
Reactions: Johannes S
You certainly caught my interest with your setup ;). I am currently working on improving the existing EVPN features so I'm always interested in what kind of setups people want to run. If I understand correctly you want to have a separate VRF that contains the routes for the BGP peers and make FRR use that VRF? Do you also want to import the learned routes into that VRF or a separate VRF?

Maybe I can look into this in the coming days/weeks when I got some more time on my hands, would be interesting to see if there's something I can come up with.
Hello!,

I'm in the same boat, probably can document my use cases.

I have two environments right now that can be of use:

E1. PVE overlay with different isolated tenants.
Each tenant should have a matching routing instance on a EVPN/VXLAN IP Fabric, JunOS based.

a. One use case would be to export/import EVPN type 5 routes between the overlay on both sides. A single peering setup should handle this with RT additional setup for segregation.
b. Another use case would be to "merge" both overlays passing Type 5 & Type 2. It will allow to have a PVE VM and a baremetal node in the IP Fabric in the same VNI.

E2. PVE overlay connected to a traditional network.
Peer doesn't support EVPN, it only supports BGP.

a. EVPN type 5 subnets should be published to the external network. Forwarding should work between external peer and dedicated interface (not through the hypervisor default gateway as today)
b. Same scenario as above, but importing routes from the external peer into the VRF of interest.
Isolation between tenants/VRFs will require as many BGP instances as VRFs and peering parties are present

Note: I mentioned "instance" for simplicity, but for HA each integration will require N BGP or EVPN/MP-BGP instances for redundancy.

Other usecases that apply to any environment above.

C1. Merging of different external routing tables towards a single EVN VRF in the overlay. External peers only understand BGP, not EVPN.

C2. BGP peering from a VM.
Usually required to implement anycast services like DNS.
VMs inject/export /32 routes at will, usually not importing any routes from the network.
It's expected for the network elements to redistribute those routes.

In any case, we don't want to merge workloads/VMs traffic with the management network for the host.

# current state of my tests:

Status 1. I failed miserably the first attempt of VXLAN/EVPN integration with JunOS.

Status 2. I was able to peer PVE nodes to a OPNsense running FRR too, the firewall receives the subnets I have created in the PVE EVPN side. Forwarding of the VM traffic is trying to exit the PVE nodes towards the firewall via the management interface (not the peering VLAN). I might be missing the export of the default gateway on the firewall side, but I don't want that to affect the host traffic. PVE management should have its own next hop/gateway, which should be separated from the VM forwarding.
 
Last edited:
  • Like
Reactions: Johannes S
Hey Stefan,

So I spent the last three days labbing this up so I could give you some sample FRR configs, and I can now see WHY the Proxmox SDN doesn't offer this functionality. While I can get Huawei, Cisco, and Extreme all happily doing exactly what I want, Linux (plain debian+FRR and VyOS) just will not establish a BGP session with source/destination addresses inside a VRF.

I think we'll just have to trunk vlans and do routing externally, which is continuing old grossness, but will meet the 'hypervisors must not be publically accessible' requirement I've been given :\

Thanks for your help

Damien
I got a manual setup stablishing 2 sets of BGP sessions to maintain BGP separation.

I can clean up my tests and share them if they are of interest.

The only culprit so far is VM <-> Host traffic for things like Datacenter Manager, LibreNMS and OIDC authentication service running in overlay.
 
Last edited:
Only exit-nodes are importing the zone vrf inside the default vrf. (to route traffic between the evpn network && the "real" network)

if you don't want this, you need an external exit node. (on a vm or directly on a physical router, peering in evpn with your hypervisor nodes, and announcing the evpn type5 default route )
 
Only exit-nodes are importing the zone vrf inside the default vrf. (to route traffic between the evpn network && the "real" network)

if you don't want this, you need an external exit node. (on a vm or directly on a physical router, peering in evpn with your hypervisor nodes, and announcing the evpn type5 default route )
that is fine and it works for VM to external networks, the issue is that outgoing connections from to host to a VM doesn't work through the external peering node, it seems to try to resolve locally (not desired).

I have:
pve01/02/03/04 are integrated with EVPN/VXLAN

pve{01,02,03,04}/fw{01,02} are integrated with BGP. BGP peering interface is tied to the service VRF

Having:
- VM01 running in pve01
- VM01 connected to a segment in the overlay
- pve01 peering to fw01 vía vlanX
- vlanX enclosed in VRF01
- fw01 as default gateway for management vlanY

I would expect connections from host to VM to follow:

pve01 (management) -> (vlanY) fw01 (vlanX) -> pve01 (peering) -> VM01 (eth0/overlay)

But it seems it doesn't work like that.