SDN: BGP controllers share routes to IPs of VMs incorrectly

May 24, 2023
11
0
1
Hello,

we currently try to set up a new cluster with an EVPN zone and BGP Controllers to share routes with our internal firewall.
The communication is working, but there seems to be a problem with the BGP routing. The nodes seem to only share routes to IPs of VMs, that run on the other nodes.

About our current setup:
The cluster consists of 3 nodes and a firewall:
Host IP address
firewall 172.16.9.1
pve-green-01 172.16.9.11
pve-green-02 172.16.9.12
pve-green-03 172.16.9.13

We configured an EVPN Controller + Zone with a single VNET (testvnet / 192.168.0.0/24) to simplify testing. (Configuration is attached)
Additionally, we configured BGP controllers for all pve nodes to share the routing information with the firewall.

There are 3 VMs in the testvnet:
VM IP address PVE Node
test1 192.168.0.101 pve-green-01
test2 192.168.0.102 pve-green-02
test3 192.168.0.103 pve-green-03

So, I would expect the Firewall to receive these routes:
from pve-green-01: 192.168.0.0/24, 192.168.0.101/32
from pve-green-02: 192.168.0.0/24, 192.168.0.102/32
from pve-green-03: 192.168.0.0/24, 192.168.0.103/32

But at the moment, the firewall receives these routes:
from pve-green-01: 192.168.0.0/24, 192.168.0.102/32, 192.168.0.103/32
from pve-green-02: 192.168.0.0/24, 192.168.0.101/32, 192.168.0.103/32
from pve-green-03: 192.168.0.0/24, 192.168.0.101/32, 192.168.0.102/32

So every node shares only the routes to IP addresses of VMs, that should not be shared by the specific node.

Maybe this is caused by the routes, that each node sees itself?
pve-green-01 only shows the routes to the VMs on the other nodes, but not the route to the VM on the node:
Code:
root@pve-green-01:/etc/pve/sdn# vtysh -c "sh ip route vrf vrf_evpnzone"
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF vrf_evpnzone:
C>* 10.255.255.0/30 is directly connected, xvrfp_evpnzone, 00:01:07
C>* 192.168.0.0/24 is directly connected, testvnet, 00:45:52
B>* 192.168.0.102/32 [200/0] via 172.16.9.12, vrfbr_evpnzone onlink, weight 1, 00:26:42
B>* 192.168.0.103/32 [200/0] via 172.16.9.13, vrfbr_evpnzone onlink, weight 1, 00:26:33


Communication between VMs in the EVPN network and also to and from networks outside of the pve cluster is working, but the routing isn't working ideally.
Because of the routes received from the pve cluster, the firewall tries to route packets to the VM test1 via pve-green-02 or pve-green-03, but never routes the packets directly to pve-green-01.
Is there a way to enable the firewall to directly route the packets to the correct nodes?
 

Attachments

  • controllers.cfg.txt
    397 bytes · Views: 16
  • pveversion.txt
    1.5 KB · Views: 3
  • subnets.cfg.txt
    69 bytes · Views: 6
  • vnets.cfg.txt
    52 bytes · Views: 8
  • zones.cfg.txt
    189 bytes · Views: 11
Hi,

I don't see how your firewall could be able to receive routes, as you don't peers bgp or evpn with your firewall ?


The way it should be done:

if your firewall support evpn: peers evpn controller directly with it, then configure exit-node on the firewall directly

if you firewalll don't support evpn: you need to define 1exit-node (or 2 for), then create bgp controller for each exit-node, and peers bgp with your firewall ip
 
Our firewall doesn't support EVPN, so we created bgp controllers on all 3 nodes:
(controllers.cfg)
Code:
bgp: bgppve-green-01
    asn 65000
    node pve-green-01
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

bgp: bgppve-green-02
    asn 65000
    node pve-green-02
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

bgp: bgppve-green-03
    asn 65000
    node pve-green-03
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

They are also configured as exit nodes in the evpn zone:
(zones.cfg)
Code:
 evpn: evpnzone
    controller evpnctrl
    vrf-vxlan 1
    exitnodes pve-green-01,pve-green-02,pve-green-03
    ipam pve
    mac BC:24:11:1C:8F:59
    mtu 1500
    nodes pve-green-03,pve-green-02,pve-green-01

With the current configuration, the firewall receives the routes sent by the pve cluster, but all nodes only send routes for IPs, that run on the other nodes.
 
oh , sorry, I didnt see the bgpcontrollers with the firewall ip as peer.


>>So every node shares only the routes to IP addresses of VMs, that should not be shared by the specific node.
>>
>>Maybe this is caused by the routes, that each node sees itself?
>>pve-green-01 only shows the routes to the VMs on the other nodes, but not the route to the VM on the node:

This is normal, because local vm ip is bridged, so you don't need any route to access ip.


>>Because of the routes received from the pve cluster, the firewall tries to route packets to the VM test1 via pve-green-02 or pve-green-03, but never >>routes the packets directly to pve-green-01.
>>Is there a way to enable the firewall to directly route the packets to the correct nodes?

AFAIK, this is not possible. (as we don't have any local route to redistritube)

In a "real" evpn network, the exit-nodes are different physical machines. We add support to proxmox nodes themself, to avoid the need of extra machines. (and in a "real" evpn network, the firewall should be the exit-node with evpn directly).

Generaly 2 exit-nodes are enough even for 20 nodes cluster. (you don't need 1exit-node by node)

the only opensource firewall appliance supporting evpn is vyos
https://vyos.io/

(pfsense,opfsense are not compatible because evpn is not implemented on bsd)
 
AFAIK, this is not possible. (as we don't have any local route to redistritube)
Thanks for the clarification, i thought this would be possible.

Generaly 2 exit-nodes are enough even for 20 nodes cluster. (you don't need 1exit-node by node)
I removed pve-green-03 from the exit nodes, so only -01 and -02 are the remaining exit nodes. I also removed the BGP controller on pve-green-03.


Initially, we found this routing behaviour when troubleshooting issues after enabling the pve firewall.
In this scenario, the firewall routes packets for vm test1 to pve-green-02 and packets for vm test2 to pve-green-01, but the responses to these packets are routed directly to the firewall.
This seems to confuse the firewall, because the first SYN packets are passed, the SYN/ACK response too, but the following packets are dropped. Is this "normal" behaviour, or did we misconfigure something?
Should we synchronize firewall states between our exit nodes with conntrackd?


Today, we also found another hickup with Exit Node local routing. When enabling this option for the zone, the exit nodes don't distribute the routes to evpn networks via BGP anymore.
After comparing the configurations, this import command is missing in the config with exit node local routing enabled:
Code:
router bgp 65000
 address-family ipv4 unicast
  import vrf vrf_evpnzone
But this seems intentional, as configuring it manually leads to another issue: The nodes can't access VMs on other nodes anymore, but connections to VMs on the same node as well communication to the outside work.
Is there a way to enable both exit node local routing and the BGP distribution of the networks?


I know that our setup isn't ideal, but we currently can't get physically separate exit nodes. I thought about implementing the exit nodes as VMs, that are directly attached to the underlying network, not on the SDN bridges. They could the act as exit nodes like physical machines would.
But I'm not sure if this is a good idea?
 
Initially, we found this routing behaviour when troubleshooting issues after enabling the pve firewall.
In this scenario, the firewall routes packets for vm test1 to pve-green-02 and packets for vm test2 to pve-green-01, but the responses to these packets are routed directly to the firewall.
This seems to confuse the firewall, because the first SYN packets are passed, the SYN/ACK response too, but the following packets are dropped. Is this "normal" behaviour, or did we misconfigure something?
For the vm firewall, it shouldn't have any impact (as it's done at bridge level).
I really don't known if you have firewall to protect pve host themself.

But, in any case, if you use multiple exit-node, you need to disable reverse path filtering:

https://pve.proxmox.com/pve-docs/chapter-pvesdn.html#_multiple_evpn_exit_nodes

Today, we also found another hickup with Exit Node local routing. When enabling this option for the zone, the exit nodes don't distribute the routes to evpn networks via BGP anymore.
The "exit node local routing" have been added, because some users would like to access to vm ip from the hypervisor management ip.
They are both in different vrf, the normal setup, for security, is to avoid to access from hypervisor ip to the vm.

What is your usecase to need to have access to vm from the hypervisors exit-node ip ?



After comparing the configurations, this import command is missing in the config with exit node local routing enabled:


Code:
router bgp 65000
 address-family ipv4 unicast
  import vrf vrf_evpnzone


But this seems intentional, as configuring it manually leads to another issue: The nodes can't access VMs on other nodes anymore, but connections to VMs on the same node as well communication to the outside work.
I need to verify this, but "exit node local routing" is really a trick for specific setup. AFAIK, I think it was to avoid routing loop

Is there a way to enable both exit node local routing and the BGP distribution of the networks?
I'm not sure. (but you don't need exit node local routing to get bgp distribution works)


I know that our setup isn't ideal, but we currently can't get physically separate exit nodes. I thought about implementing the exit nodes as VMs, that are directly attached to the underlying network, not on the SDN bridges. They could the act as exit nodes like physical machines would.
But I'm not sure if this is a good idea?
yes, it should work. in the future, we are looking to manage exit-node with vm too. https://bugzilla.proxmox.com/show_bug.cgi?id=3382
 
What is your usecase to need to have access to vm from the hypervisors exit-node ip ?
We need to access a variety of VMs from the host system and the other way around, e.g. for the monitoring system, local apt repository, LDAP Server and some more.

I'm not sure. (but you don't need exit node local routing to get bgp distribution works)
Yes, we don't need exit node local routing for bgp. But If we try both at the same time, the node stops sharing the routes via bgp.

About using a VM as the exit node:
I implemented a PoC based on a a Debian VM with frr.
At the moment, this seems to work great. I'll have to do some more testing, but it should solve our issues.
Then we also don't need exit node local routing on the pve nodes, since they aren't exit nodes anymore.
(Except this only moves the problem to the exit router vms, but I'm still working on this)
 
After testing some more with VMs as exit nodes, an issue with the pve firewall appeared. Packets are dropped for stateful connections, if the routing isn't symmetrical, i.e. traffic from the physical network to vpn VMs is routed via one exit router vm and traffic from epvn VMs to the physical network via the other exit router vm.

We found the issue is related to the pve firewall rule that drops packets with ctstate INVALID:
Code:
Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

After inserting custom rules with iptables, the connection is working as expected:
(VM ID 10005 and 10006 are the exit routers)
Code:
Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10006i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10006i0 --physdev-is-bridged
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

This makes sense to me, as the exit router, that only sees the responses doesn't know that the packets belong to an already existing connection, so these packets are in ctstate INVALID and will be dropped.
But I think in this scenario, it is a valid use case to be able to route asymmetrically, so i would like to add these rules to the pve firewall (or other rules, if you someone can recommend a better approach).

But I expect these rules to be overwritten by pve-firewall, or at least be cleared after a reboot. Is there a way to add custom rules to the rules created by the pve firewall?
 
After testing some more with VMs as exit nodes, an issue with the pve firewall appeared. Packets are dropped for stateful connections, if the routing isn't symmetrical, i.e. traffic from the physical network to vpn VMs is routed via one exit router vm and traffic from epvn VMs to the physical network via the other exit router vm.

We found the issue is related to the pve firewall rule that drops packets with ctstate INVALID:
Code:
Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination        
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

After inserting custom rules with iptables, the connection is working as expected:
(VM ID 10005 and 10006 are the exit routers)
Code:
Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination        
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10006i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10006i0 --physdev-is-bridged
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

This makes sense to me, as the exit router, that only sees the responses doesn't know that the packets belong to an already existing connection, so these packets are in ctstate INVALID and will be dropped.
But I think in this scenario, it is a valid use case to be able to route asymmetrically, so i would like to add these rules to the pve firewall (or other rules, if you someone can recommend a better approach).

But I expect these rules to be overwritten by pve-firewall, or at least be cleared after a reboot. Is there a way to add custom rules to the rules created by the pve firewall?
you can add "nf_conntrack_allow_invalid: 1" in host.fw


and add "net.ipv4.conf.default.rp_filter=0" in sysctl.conf too.
 
>>We need to access a variety of VMs from the host system and the other way around, e.g. for the monitoring system, local apt repository, LDAP >>Server and some more.

do you really need to access to vms from each nodes of your pve cluster ?
I mean, it's really a problem from the exit-node, but other (non exit) nodes, it's not problem. (you can add a route from the proxmox not exit-node with exit-nodes as gateway ith vm nodes as destination).

Then put theses vms on non-exit nodes.
 
oh , sorry, I didnt see the bgpcontrollers with the firewall ip as peer.


>>So every node shares only the routes to IP addresses of VMs, that should not be shared by the specific node.
>>
>>Maybe this is caused by the routes, that each node sees itself?
>>pve-green-01 only shows the routes to the VMs on the other nodes, but not the route to the VM on the node:

This is normal, because local vm ip is bridged, so you don't need any route to access ip.


>>Because of the routes received from the pve cluster, the firewall tries to route packets to the VM test1 via pve-green-02 or pve-green-03, but never >>routes the packets directly to pve-green-01.
>>Is there a way to enable the firewall to directly route the packets to the correct nodes?

AFAIK, this is not possible. (as we don't have any local route to redistritube)

In a "real" evpn network, the exit-nodes are different physical machines. We add support to proxmox nodes themself, to avoid the need of extra machines. (and in a "real" evpn network, the firewall should be the exit-node with evpn directly).

Generaly 2 exit-nodes are enough even for 20 nodes cluster. (you don't need 1exit-node by node)

the only opensource firewall appliance supporting evpn is vyos
https://vyos.io/

(pfsense,opfsense are not compatible because evpn is not implemented on bsd)

hi!,

awesome work so far. I've looking at this for a new deployment and I'm not comfortable merging L3 for PVE management with workloads traffic.

Most often than not, you want your workloads isolated from the management plane of the hypervisor (network wise).

could you explore implementing isolation?.

Ways I can think to solve it:

A- vyos virtual machines as you mentioned: very similar to NSX-T edge VMs
B - network contexts in the host: more complex but doesn't need lifecycle management of a virtual machine OS.
 
>>We need to access a variety of VMs from the host system and the other way around, e.g. for the monitoring system, local apt repository, LDAP >>Server and some more.

do you really need to access to vms from each nodes of your pve cluster ?
I mean, it's really a problem from the exit-node, but other (non exit) nodes, it's not problem. (you can add a route from the proxmox not exit-node with exit-nodes as gateway ith vm nodes as destination).

Then put theses vms on non-exit nodes.
I just catched up with the thread

if there's actual isolation (that's a good thing) and you need your monitoring VM to reach a host the correct way to do it is to have the leaking done in an external border node (your DCFW for example).

A. IaaS Platform in its own VRF
B. One Tenant per VRF.

We do leaking in the DCFW to integrate two environments:
IaaS Platform VRF vs Company Tooling Team (support tooling) VRF
 
A- vyos virtual machines as you mentioned: very similar to NSX-T edge VMs
B - network contexts in the host: more complex but doesn't need lifecycle management of a virtual machine OS.

A third way, if you have a bigger deployment, is to have switch/routers supported evpn.

Personnaly, I'm using my arista switch as evpn exit-nodes.


The current proxmox exit-node feature is really here for small setups where users don't have a router supporting evpn.

About vyos, it should be easy to maintain, as it's mostly stateless, you can rebuild a new vm with newer image version, send configuration, and it's done.
But don't expect it for 2024, maybe end of 2025, as I really don't have time to work on this currently.
 
A third way, if you have a bigger deployment, is to have switch/routers supported evpn.

Personnaly, I'm using my arista switch as evpn exit-nodes.


The current proxmox exit-node feature is really here for small setups where users don't have a router supporting evpn.

About vyos, it should be easy to maintain, as it's mostly stateless, you can rebuild a new vm with newer image version, send configuration, and it's done.
But don't expect it for 2024, maybe end of 2025, as I really don't have time to work on this currently.
I have EVPN capable JunOS switches around, could you share a config example?. Are servers and switches participating in the same VTEP underlay for EVPN/VXLAN?, can I isolate different VRFs with RT?
 
I have EVPN capable JunOS switches around, could you share a config example?. Are servers and switches participating in the same VTEP underlay for EVPN/VXLAN?, can I isolate different VRFs with RT?
I really don't have any experience with junos syntax.

you need to create symetric routing evpn config:
https://www.juniper.net/documentati...m-t2-evpn-vxlan-ov__section_sym-routing-model

Basically, you'll have same config than a proxmox exit-node. the same anycast ip must be present on juniper too, you need to add junipers routers ip in evpn peers list on proxmox side.
And the junipers switch need to announce the default evpn type-5 route 0.0.0.0.

Basically, the junipers switches are another vtep in the network, but they announce the default route.
 
I really don't have any experience with junos syntax.

you need to create symetric routing evpn config:
https://www.juniper.net/documentati...m-t2-evpn-vxlan-ov__section_sym-routing-model

Basically, you'll have same config than a proxmox exit-node. the same anycast ip must be present on juniper too, you need to add junipers routers ip in evpn peers list on proxmox side.
And the junipers switch need to announce the default evpn type-5 route 0.0.0.0.

Basically, the junipers switches are another vtep in the network, but they announce the default route.
I was more thinking on an example of what's required in PVE/SDN side as configuration for external exit nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!