SDN: BGP controllers share routes to IPs of VMs incorrectly

roland.troeger · May 14, 2024

Hello,

we currently try to set up a new cluster with an EVPN zone and BGP Controllers to share routes with our internal firewall.
The communication is working, but there seems to be a problem with the BGP routing. The nodes seem to only share routes to IPs of VMs, that run on the other nodes.

About our current setup:
The cluster consists of 3 nodes and a firewall:
Host IP address
firewall 172.16.9.1
pve-green-01 172.16.9.11
pve-green-02 172.16.9.12
pve-green-03 172.16.9.13

We configured an EVPN Controller + Zone with a single VNET (testvnet / 192.168.0.0/24) to simplify testing. (Configuration is attached)
Additionally, we configured BGP controllers for all pve nodes to share the routing information with the firewall.

There are 3 VMs in the testvnet:
VM IP address PVE Node
test1 192.168.0.101 pve-green-01
test2 192.168.0.102 pve-green-02
test3 192.168.0.103 pve-green-03

So, I would expect the Firewall to receive these routes:
from pve-green-01: 192.168.0.0/24, 192.168.0.101/32
from pve-green-02: 192.168.0.0/24, 192.168.0.102/32
from pve-green-03: 192.168.0.0/24, 192.168.0.103/32

But at the moment, the firewall receives these routes:
from pve-green-01: 192.168.0.0/24, 192.168.0.102/32, 192.168.0.103/32
from pve-green-02: 192.168.0.0/24, 192.168.0.101/32, 192.168.0.103/32
from pve-green-03: 192.168.0.0/24, 192.168.0.101/32, 192.168.0.102/32

So every node shares only the routes to IP addresses of VMs, that should not be shared by the specific node.

Maybe this is caused by the routes, that each node sees itself?
pve-green-01 only shows the routes to the VMs on the other nodes, but not the route to the VM on the node:

Code:

root@pve-green-01:/etc/pve/sdn# vtysh -c "sh ip route vrf vrf_evpnzone"
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF vrf_evpnzone:
C>* 10.255.255.0/30 is directly connected, xvrfp_evpnzone, 00:01:07
C>* 192.168.0.0/24 is directly connected, testvnet, 00:45:52
B>* 192.168.0.102/32 [200/0] via 172.16.9.12, vrfbr_evpnzone onlink, weight 1, 00:26:42
B>* 192.168.0.103/32 [200/0] via 172.16.9.13, vrfbr_evpnzone onlink, weight 1, 00:26:33

Communication between VMs in the EVPN network and also to and from networks outside of the pve cluster is working, but the routing isn't working ideally.
Because of the routes received from the pve cluster, the firewall tries to route packets to the VM test1 via pve-green-02 or pve-green-03, but never routes the packets directly to pve-green-01.
Is there a way to enable the firewall to directly route the packets to the correct nodes?

spirit · May 14, 2024

Hi,

I don't see how your firewall could be able to receive routes, as you don't peers bgp or evpn with your firewall ?

The way it should be done:

if your firewall support evpn: peers evpn controller directly with it, then configure exit-node on the firewall directly

if you firewalll don't support evpn: you need to define 1exit-node (or 2 for), then create bgp controller for each exit-node, and peers bgp with your firewall ip

roland.troeger · May 14, 2024

Our firewall doesn't support EVPN, so we created bgp controllers on all 3 nodes:
(controllers.cfg)

Code:

bgp: bgppve-green-01
    asn 65000
    node pve-green-01
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

bgp: bgppve-green-02
    asn 65000
    node pve-green-02
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

bgp: bgppve-green-03
    asn 65000
    node pve-green-03
    peers 172.16.9.1
    bgp-multipath-as-path-relax 0
    ebgp 0

They are also configured as exit nodes in the evpn zone:
(zones.cfg)

Code:

 evpn: evpnzone
    controller evpnctrl
    vrf-vxlan 1
    exitnodes pve-green-01,pve-green-02,pve-green-03
    ipam pve
    mac BC:24:11:1C:8F:59
    mtu 1500
    nodes pve-green-03,pve-green-02,pve-green-01

With the current configuration, the firewall receives the routes sent by the pve cluster, but all nodes only send routes for IPs, that run on the other nodes.

spirit · May 14, 2024

oh , sorry, I didnt see the bgpcontrollers with the firewall ip as peer.

>>So every node shares only the routes to IP addresses of VMs, that should not be shared by the specific node.
>>
>>Maybe this is caused by the routes, that each node sees itself?
>>pve-green-01 only shows the routes to the VMs on the other nodes, but not the route to the VM on the node:

This is normal, because local vm ip is bridged, so you don't need any route to access ip.

>>Because of the routes received from the pve cluster, the firewall tries to route packets to the VM test1 via pve-green-02 or pve-green-03, but never >>routes the packets directly to pve-green-01.
>>Is there a way to enable the firewall to directly route the packets to the correct nodes?

AFAIK, this is not possible. (as we don't have any local route to redistritube)

In a "real" evpn network, the exit-nodes are different physical machines. We add support to proxmox nodes themself, to avoid the need of extra machines. (and in a "real" evpn network, the firewall should be the exit-node with evpn directly).

Generaly 2 exit-nodes are enough even for 20 nodes cluster. (you don't need 1exit-node by node)

the only opensource firewall appliance supporting evpn is vyos
https://vyos.io/

(pfsense,opfsense are not compatible because evpn is not implemented on bsd)

roland.troeger · May 16, 2024

AFAIK, this is not possible. (as we don't have any local route to redistritube)

Thanks for the clarification, i thought this would be possible.

Generaly 2 exit-nodes are enough even for 20 nodes cluster. (you don't need 1exit-node by node)

I removed pve-green-03 from the exit nodes, so only -01 and -02 are the remaining exit nodes. I also removed the BGP controller on pve-green-03.

Initially, we found this routing behaviour when troubleshooting issues after enabling the pve firewall.
In this scenario, the firewall routes packets for vm test1 to pve-green-02 and packets for vm test2 to pve-green-01, but the responses to these packets are routed directly to the firewall.
This seems to confuse the firewall, because the first SYN packets are passed, the SYN/ACK response too, but the following packets are dropped. Is this "normal" behaviour, or did we misconfigure something?
Should we synchronize firewall states between our exit nodes with conntrackd?

Today, we also found another hickup with Exit Node local routing. When enabling this option for the zone, the exit nodes don't distribute the routes to evpn networks via BGP anymore.
After comparing the configurations, this import command is missing in the config with exit node local routing enabled:

Code:

router bgp 65000
 address-family ipv4 unicast
  import vrf vrf_evpnzone

But this seems intentional, as configuring it manually leads to another issue: The nodes can't access VMs on other nodes anymore, but connections to VMs on the same node as well communication to the outside work.
Is there a way to enable both exit node local routing and the BGP distribution of the networks?

I know that our setup isn't ideal, but we currently can't get physically separate exit nodes. I thought about implementing the exit nodes as VMs, that are directly attached to the underlying network, not on the SDN bridges. They could the act as exit nodes like physical machines would.
But I'm not sure if this is a good idea?

spirit · May 17, 2024

roland.troeger said:
Initially, we found this routing behaviour when troubleshooting issues after enabling the pve firewall.
In this scenario, the firewall routes packets for vm test1 to pve-green-02 and packets for vm test2 to pve-green-01, but the responses to these packets are routed directly to the firewall.
This seems to confuse the firewall, because the first SYN packets are passed, the SYN/ACK response too, but the following packets are dropped. Is this "normal" behaviour, or did we misconfigure something?

For the vm firewall, it shouldn't have any impact (as it's done at bridge level).
I really don't known if you have firewall to protect pve host themself.

But, in any case, if you use multiple exit-node, you need to disable reverse path filtering:

https://pve.proxmox.com/pve-docs/chapter-pvesdn.html#_multiple_evpn_exit_nodes

roland.troeger said:
Today, we also found another hickup with Exit Node local routing. When enabling this option for the zone, the exit nodes don't distribute the routes to evpn networks via BGP anymore.

The "exit node local routing" have been added, because some users would like to access to vm ip from the hypervisor management ip.
They are both in different vrf, the normal setup, for security, is to avoid to access from hypervisor ip to the vm.

What is your usecase to need to have access to vm from the hypervisors exit-node ip ?

roland.troeger said:
After comparing the configurations, this import command is missing in the config with exit node local routing enabled:

Code:

router bgp 65000 address-family ipv4 unicast import vrf vrf_evpnzone

But this seems intentional, as configuring it manually leads to another issue: The nodes can't access VMs on other nodes anymore, but connections to VMs on the same node as well communication to the outside work.

I need to verify this, but "exit node local routing" is really a trick for specific setup. AFAIK, I think it was to avoid routing loop

roland.troeger said:
Is there a way to enable both exit node local routing and the BGP distribution of the networks?

I'm not sure. (but you don't need exit node local routing to get bgp distribution works)

roland.troeger said:
I know that our setup isn't ideal, but we currently can't get physically separate exit nodes. I thought about implementing the exit nodes as VMs, that are directly attached to the underlying network, not on the SDN bridges. They could the act as exit nodes like physical machines would.
But I'm not sure if this is a good idea?

yes, it should work. in the future, we are looking to manage exit-node with vm too. https://bugzilla.proxmox.com/show_bug.cgi?id=3382

roland.troeger · May 22, 2024

spirit said:
What is your usecase to need to have access to vm from the hypervisors exit-node ip ?

We need to access a variety of VMs from the host system and the other way around, e.g. for the monitoring system, local apt repository, LDAP Server and some more.

spirit said:
I'm not sure. (but you don't need exit node local routing to get bgp distribution works)

Yes, we don't need exit node local routing for bgp. But If we try both at the same time, the node stops sharing the routes via bgp.

About using a VM as the exit node:
I implemented a PoC based on a a Debian VM with frr.
At the moment, this seems to work great. I'll have to do some more testing, but it should solve our issues.
Then we also don't need exit node local routing on the pve nodes, since they aren't exit nodes anymore.
(Except this only moves the problem to the exit router vms, but I'm still working on this)

roland.troeger · May 23, 2024

After testing some more with VMs as exit nodes, an issue with the pve firewall appeared. Packets are dropped for stateful connections, if the routing isn't symmetrical, i.e. traffic from the physical network to vpn VMs is routed via one exit router vm and traffic from epvn VMs to the physical network via the other exit router vm.

We found the issue is related to the pve firewall rule that drops packets with ctstate INVALID:

Code:

Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

After inserting custom rules with iptables, the connection is working as expected:
(VM ID 10005 and 10006 are the exit routers)

Code:

Chain PVEFW-FORWARD (1 references)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10005i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap10006i0 --physdev-is-bridged
ACCEPT     all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap10006i0 --physdev-is-bridged
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
PVEFW-FWBR-IN  all  --  anywhere             anywhere             PHYSDEV match --physdev-in fwln+ --physdev-is-bridged
PVEFW-FWBR-OUT  all  --  anywhere             anywhere             PHYSDEV match --physdev-out fwln+ --physdev-is-bridged
           all  --  anywhere             anywhere             /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

This makes sense to me, as the exit router, that only sees the responses doesn't know that the packets belong to an already existing connection, so these packets are in ctstate INVALID and will be dropped.
But I think in this scenario, it is a valid use case to be able to route asymmetrically, so i would like to add these rules to the pve firewall (or other rules, if you someone can recommend a better approach).

But I expect these rules to be overwritten by pve-firewall, or at least be cleared after a reboot. Is there a way to add custom rules to the rules created by the pve firewall?

spirit · May 23, 2024

roland.troeger said:
After testing some more with VMs as exit nodes, an issue with the pve firewall appeared. Packets are dropped for stateful connections, if the routing isn't symmetrical, i.e. traffic from the physical network to vpn VMs is routed via one exit router vm and traffic from epvn VMs to the physical network via the other exit router vm.

We found the issue is related to the pve firewall rule that drops packets with ctstate INVALID:

Code:

Chain PVEFW-FORWARD (1 references) target prot opt source destination DROP all -- anywhere anywhere ctstate INVALID ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED PVEFW-FWBR-IN all -- anywhere anywhere PHYSDEV match --physdev-in fwln+ --physdev-is-bridged PVEFW-FWBR-OUT all -- anywhere anywhere PHYSDEV match --physdev-out fwln+ --physdev-is-bridged all -- anywhere anywhere /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

After inserting custom rules with iptables, the connection is working as expected:
(VM ID 10005 and 10006 are the exit routers)

Code:

Chain PVEFW-FORWARD (1 references) target prot opt source destination ACCEPT all -- anywhere anywhere PHYSDEV match --physdev-out tap10005i0 --physdev-is-bridged ACCEPT all -- anywhere anywhere PHYSDEV match --physdev-in tap10005i0 --physdev-is-bridged ACCEPT all -- anywhere anywhere PHYSDEV match --physdev-out tap10006i0 --physdev-is-bridged ACCEPT all -- anywhere anywhere PHYSDEV match --physdev-in tap10006i0 --physdev-is-bridged DROP all -- anywhere anywhere ctstate INVALID ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED PVEFW-FWBR-IN all -- anywhere anywhere PHYSDEV match --physdev-in fwln+ --physdev-is-bridged PVEFW-FWBR-OUT all -- anywhere anywhere PHYSDEV match --physdev-out fwln+ --physdev-is-bridged all -- anywhere anywhere /* PVESIG:qnNexOcGa+y+jebd4dAUqFSp5nw */

This makes sense to me, as the exit router, that only sees the responses doesn't know that the packets belong to an already existing connection, so these packets are in ctstate INVALID and will be dropped.
But I think in this scenario, it is a valid use case to be able to route asymmetrically, so i would like to add these rules to the pve firewall (or other rules, if you someone can recommend a better approach).

But I expect these rules to be overwritten by pve-firewall, or at least be cleared after a reboot. Is there a way to add custom rules to the rules created by the pve firewall?

you can add "nf_conntrack_allow_invalid: 1" in host.fw

and add "net.ipv4.conf.default.rp_filter=0" in sysctl.conf too.

spirit · May 23, 2024

>>We need to access a variety of VMs from the host system and the other way around, e.g. for the monitoring system, local apt repository, LDAP >>Server and some more.

do you really need to access to vms from each nodes of your pve cluster ?
I mean, it's really a problem from the exit-node, but other (non exit) nodes, it's not problem. (you can add a route from the proxmox not exit-node with exit-nodes as gateway ith vm nodes as destination).

Then put theses vms on non-exit nodes.

Search

Search

SDN: BGP controllers share routes to IPs of VMs incorrectly

roland.troeger

New Member

Attachments

spirit

Distinguished Member

roland.troeger

New Member

spirit

Distinguished Member

roland.troeger

New Member

spirit

Distinguished Member

roland.troeger

New Member

roland.troeger

New Member

spirit

Distinguished Member

spirit

Distinguished Member