[SOLVED] SDN with EVPN controller: Routing loop when using multiple exit nodes

May 24, 2023
11
0
1
Hi all,

currently I am trying to set up a cluster with SDN with an EVPN controller using multiple exit nodes, but I can't get around an issue with a routing loop.
The packets I send are stuck in a loop between both exit nodes and never forwarded outside of the cluster.
Each exit node has a default route to the other exit node, forwarding the packet to the other node instead of sending it outside the cluster.

I confirmed this using tcpdump on both nodes, showing the same packet (sequence number) over and over again (sometimes also TTL exceeded messages):
Code:
root@pve-red-01:~# tcpdump -i any icmp
[...]
11:24:21.140478 vrfvx_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140478 vrfbr_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140488 vrfbr_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140490 vrfvx_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140535 vrfvx_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140535 vrfbr_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140545 vrfbr_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140546 vrfvx_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140615 vrfvx_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140615 vrfbr_redzone In  IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140625 vrfbr_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140627 vrfvx_redzone Out IP 192.168.0.10 > 172.16.12.1: ICMP echo request, id 1442, seq 4, length 64
11:24:21.140768 vrfvx_redzone In  IP 192.168.0.1 > 192.168.0.10: ICMP time exceeded in-transit, length 92
11:24:21.140768 vrfbr_redzone In  IP 192.168.0.1 > 192.168.0.10: ICMP time exceeded in-transit, length 92

So my question is: Is this mistake in my configuration oder a bug in the current PVE Version (8.1.3)? What can I do to fix this?
I already found this old commit (April 2022), that seems to fix this issue. The configuration that has been added in this commit seems to be present in my vtysh running config (see attached file), but I'm not familiar with this configuration.

For rerefence i attached the complete sdn configuration from /etc/pve/sdn/.

Overview of nodes and IPs
HostIP
Firewall172.16.12.1/25
pve-red-01172.16.12.11/25
pve-red-02172.16.12.12/25
pve-red-03172.16.12.13/25
Test VNet Gateway (SDN VNet Config)192.168.0.1/24
Test VM192.168.0.10/24

Routes on all nodes
Code:
root@pve-red-01:/tmp/sdn# vtysh -c "sh ip route"
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

B>* 0.0.0.0/0 [200/0] via 172.16.12.12, vrfbr_redzone (vrf vrf_redzone) onlink, weight 1, 00:24:50
C>* 172.16.12.0/25 is directly connected, vmbr0, 00:25:26
C>* 172.16.12.128/25 is directly connected, bond1, 00:25:24
B>* 192.168.0.0/24 [20/0] is directly connected, test (vrf vrf_redzone), weight 1, 00:25:27

root@pve-red-02:~# vtysh -c "sh ip route"
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

B>* 0.0.0.0/0 [200/0] via 172.16.12.11, vrfbr_redzone (vrf vrf_redzone) onlink, weight 1, 00:06:00
C>* 172.16.12.0/25 is directly connected, vmbr0, 00:06:36
C>* 172.16.12.128/25 is directly connected, bond1, 00:06:39
B>* 192.168.0.0/24 [20/0] is directly connected, test (vrf vrf_redzone), weight 1, 00:06:39

root@pve-red-03:~# vtysh -c "sh ip route"
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

C>* 172.16.12.0/25 is directly connected, vmbr0, 00:06:44
C>* 172.16.12.128/25 is directly connected, bond1, 00:06:46

RP-Filter
Just to be sure, i checked the RP-Filter config, it is disabled on all nodes:
Code:
root@pve-red-01:~# sysctl -a | grep -P "net.ipv4.conf.(default|all).rp_filter"
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
 

Attachments

Hi,

are you trying to ping from 192.168.0.1 (in evpn network) to 192.168.0.10 (in the real network ?)

If, yes, it's cannot work currently.
The only way could be to bridge the exit-node at layer2, but currently, it's possible to do it with multiple nodes, this need a solution to do mlag across proxmox nodes, and they are no opensource solution to do this.

You need different subnets between your evpn networks and your real networks.
 
Hi,

I am trying to ping from 192.168.0.10 (EVPN network) to 172.16.12.1 (real network). 192.168.0.1 is just the gateway in the EVPN network.
The real network is 172.16.12.0/25, the evpn network is 192.168.0.0/24
The ping between evpn and real network works as expected, as long as i only configure one exit node. But if i try to add another exit node, the traffic seems to be stuck in a loop between both exit nodes.
 
Hi,

I am trying to ping from 192.168.0.10 (EVPN network) to 172.16.12.1 (real network). 192.168.0.1 is just the gateway in the EVPN network.
The real network is 172.16.12.0/25, the evpn network is 192.168.0.0/24
The ping between evpn and real network works as expected, as long as i only configure one exit node. But if i try to add another exit node, the traffic seems to be stuck in a loop between both exit nodes.
ah ok , sorry.

so, you are trying to ping your firewall ip 172.16.12.1 ?

You should have a route on the firewall like : "ip route 192.168.0.0/24 gw 172.16.12.11" , because it can't see the evpn subnets by itself and don't known where to reply.

The best way is to bgp between exit-nodes and your firewall to announce the evpn subnets dynamically
 
Yes, I am trying to ping the firewall IP. The route is already configured, poiting to 172.16.12.11. The ping works when I configure only pve-red-01 as exit node, so I am sure the underlaying network is configured correctly.

But the problem is: Once I configure a second exit node, the packet doesn't leave the cluster. It doesn't show up in firewall-logs and i can see the packet looping between both nodes with "tcpdump -i any icmp".
I think the cause for this is that exit nodes accept default routes to each other (see the vtysh -c "sh ip route" output in the original post), but I don't understand why this happens yet.

About BGP: Currently it's a static route, but BGP is planned, once I have time to configure it.
 
mmm, this is strange that's it's looping.

you should see incoming ping on the firewall.
The default route (vtysh -c "sh ip route" ), is only a bgp route (B>*) , and it's not injected in kernel (K>*),
#ip route show on the host should see kernel route

But even if the default was injected, as the firewall is in the same subnet than the proxmox nodes, they should see the ip/mac in their arp table, so it should have prority over the default route.

I'll try to reproduce on my side, maybe it's a kernel bug, this look really strange.

I'll reply in this thread next week.
 
The output of vtysh -c "sh bgp l2vpn evpn" is quite long, so i attached it as files.

I also check the output of ip route show for each host.
One thing i noticed is the exit node hosts (01+02) routes to the individual ips and the subnet in the evpn network (192.168.x.x), but the "non-exit-node" 03 doesn't. It only shows routes in the real network (172.16.x.x)

So i tried to add 03 as the third exit node. After this change, the host also sees individual IPs in the evpn-subnet. Is this the correct behaviour?

Node 01 (Exit Node)
Code:
root@pve-red-01:~# ip route show
default via 172.16.12.1 dev vmbr0 proto kernel onlink
default nhid 33 via 172.16.12.12 dev vrfbr_redzone proto bgp metric 20 onlink
172.16.12.0/25 dev vmbr0 proto kernel scope link src 172.16.12.11
172.16.12.128/25 dev bond1 proto kernel scope link src 172.16.12.131
192.168.0.0/24 nhid 4 dev test proto bgp metric 20
192.168.0.11 nhid 33 via 172.16.12.12 dev vrfbr_redzone proto bgp metric 20 onlink
192.168.0.12 nhid 31 via 172.16.12.13 dev vrfbr_redzone proto bgp metric 20 onlink

Node 02 (Exit Node)
Code:
root@pve-red-02:~# ip route show
default via 172.16.12.1 dev vmbr0 proto kernel onlink
default nhid 19 via 172.16.12.11 dev vrfbr_redzone proto bgp metric 20 onlink
172.16.12.0/25 dev vmbr0 proto kernel scope link src 172.16.12.12
172.16.12.128/25 dev bond1 proto kernel scope link src 172.16.12.132
192.168.0.0/24 nhid 81 dev test proto bgp metric 20
192.168.0.10 nhid 19 via 172.16.12.11 dev vrfbr_redzone proto bgp metric 20 onlink
192.168.0.12 nhid 76 via 172.16.12.13 dev vrfbr_redzone proto bgp metric 20 onlink

Node 03 (non-exit node): doesn't show routes to evpn network
Code:
root@pve-red-03:~# ip route show
default via 172.16.12.1 dev vmbr0 proto kernel onlink
172.16.12.0/25 dev vmbr0 proto kernel scope link src 172.16.12.13
172.16.12.128/25 dev bond1 proto kernel scope link src 172.16.12.133

Node 03 (as the third exit node): shows routes to evpn network
Code:
root@pve-red-03:~# ip route show
default via 172.16.12.1 dev vmbr0 proto kernel onlink
default nhid 93 proto bgp metric 20
    nexthop via 172.16.12.11 dev vrfbr_redzone weight 1 onlink
    nexthop via 172.16.12.12 dev vrfbr_redzone weight 1 onlink
172.16.12.0/25 dev vmbr0 proto kernel scope link src 172.16.12.13
172.16.12.128/25 dev bond1 proto kernel scope link src 172.16.12.133
192.168.0.0/24 nhid 103 dev test proto bgp metric 20
192.168.0.10 nhid 28 via 172.16.12.11 dev vrfbr_redzone proto bgp metric 20 onlink
192.168.0.11 nhid 86 via 172.16.12.12 dev vrfbr_redzone proto bgp metric 20 onlink
 

Attachments

#ip route show only show the kernel route in the defaut vrf

to show kernel route inside a vrf (a zone in proxmox), you need to do

#ip route show vrf vrf_<zone>

The exit-nodes, import routes from the vrf_<zone> inside the default vrf. (to be able to route from the real network(default vrf) to the evpn network).
That's why you don't see evpn routes inside the default vrf on non-exit nodes.
This is the correct behaviour.
 
mmmm, something looks wrong in "sh bgp evpn l2vpn"

It's like "exitnodes-primary pve-red-01" is not used.
in "sh bgp evpn l2vpn", you have routes 0.0.0.0 announced by exit-nodes, when a primary exit node is configured, the other nodes should have a "metric = 200" , and it don't see it.

can you send content of /etc/frr/frr.conf of each node ?

and also: what is your pve version (and pve-network package) ? #pve-version -v
 
Also another bug (but I think it's on my side), 1 exit-node import the default route from evpn vrf to default vrf announced from other exit-node. That's wrong, really wrong... we should only import evpn subnets..

That's why we have a loop here.

Code:
root@pve-red-01:~# ip route show
default nhid 33 via 172.16.12.12 dev vrfbr_redzone proto bgp metric 20 onlink

Code:
root@pve-red-02:~# ip route show
default nhid 19 via 172.16.12.11 dev vrfbr_redzone proto bgp metric 20 onlink

I'll look at the code, maybe it's a regression in filtering in frr, I'm really not sure .
 
pve 8.1.3 / libpve-network-perl 0.9.4 (complete output in pveversion.txt)
frr configs are attached

Also, I would prefer to not define a primary exit node, but use two or three exit nodes in parallel.
But the option isn't optional in the current version of the gui (at least for me), although i remember it being optional in the past.
 

Attachments

Ok, I'm able to reproduce.

It's an frr bug / regression with evpn route-map filtering.
in /etc/frr/frr.conf
Code:
route-map MAP_VTEP_IN deny 1
 match evpn vni 1
 match evpn route-type prefix
exit

the "match evpn" is not working anymore.

That's why we have theses buggy default routes imported between the exit-node
https://github.com/FRRouting/frr/issues/14419

I'm going to debug this.
 
  • Like
Reactions: roland.troeger
Hi,

I have found the problem, it's really a bug in frr, but currently it's not fixed.
I have workaround it, generating a different configuration


can you try this patch:


wget https://mutulin1.odiso.net/libpve-network-perl_0.9.5_all.deb
dpkg -i libpve-network-perl_0.9.5_all.deb

on each node, then regenerate sdn config again

you should have in /etc/frr/frr.conf on the exit-nodes something like


on primary exit-node
Code:
!
ip prefix-list only_default seq 1 permit 0.0.0.0/0
!
ipv6 prefix-list only_default_v6 seq 1 permit ::/0
!
route-map MAP_VTEP_IN deny 1
 match ip address prefix-list only_default
exit
!
route-map MAP_VTEP_IN deny 2
 match ip address prefix-list only_default_v6
exit
!
route-map MAP_VTEP_IN permit 3
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!

on secondary exit-node

Code:
!
ip prefix-list only_default seq 1 permit 0.0.0.0/0
!
ipv6 prefix-list only_default_v6 seq 1 permit ::/0
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
 match ip address prefix-list only_default
 set metric 200
exit
!
route-map MAP_VTEP_OUT permit 2
 match ipv6 address prefix-list only_default_v6
 set metric 200
exit



Previously, the buggy conf was

Code:
route-map MAP_VTEP_IN deny 1
 match evpn vni 10000
 match evpn route-type prefix
 ...
exit
!
 
  • Like
Reactions: roland.troeger
Hi,

Thanks for debugging this issue. i just got to test the path today. It works like a charm, the config is as expected and the network behaves as expected.
As this will probably break on the next upgrade, is there anything i can watch to find out when the fix made it into the official release?
 
Hey guys,

I saved this thread for when I upgraded to Prox8. First cluster, no problem, upgrade went fairly seamless. Second cluster, bit more complicated, as it has 2 diff EVPN zones with separate exit nodes but share a an EVPN controller:

Zones (EVPN):
* VM100
* VM200

VNETS:
* VM100 has 1 /24
* VM200 has 2 with multiple /26's

CONTROLLERS:
* bgphostA - BGP (peers with border leaf) <-- Exit node for VM100
* bgphostB - BGP (peers with border leaf) <-- Exit node for VM200
(note: i've removed my redundant exit nodes until upgrade is done)
* evpncontroller - used for both zones, each host in my cluster is peered (9 total)

In Prox7, I had to adjust /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm (around line 169) on exit nodes to:
Perl:
push @{$routemap_config}, "match evpn vni $vrfvxlan";

This way, if I had a VM using VM100 on hostB (the exit node for VM200, traffic would route to the correct exit node), for example:
Code:
route-map MAP_VTEP_IN deny 1
 match evpn vni 10301
 match evpn route-type prefix
exit
!
route-map MAP_VTEP_IN permit 2
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!

With the updated config filtering out all default routes, the exit node for VM200 cannot have a VM on it that attaches to VM100 as traffic cannot exit since their is no default route in the vrf_VM100 routing table.

Any suggestions on how I should handle? I've got some ideas, but hate modifying native code as always a prob with updates.

For now, I've reverted the exit node to a host still using Prox7 but would like to get this all upgraded this week.

On a separate note, I've noticed a couple times, as well in one of the messages above, set metric 200 being used on a secondary exit host. With BGP the default local preference is 100, so if you're setting the secondary node to 200, traffic is going to prefer that (highest number is preferred):

Code:
on secondary exit-node

Code:
!
ip prefix-list only_default seq 1 permit 0.0.0.0/0
!
ipv6 prefix-list only_default_v6 seq 1 permit ::/0
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
match ip address prefix-list only_default
set metric 200
exit
!
route-map MAP_VTEP_OUT permit 2
match ipv6 address prefix-list only_default_v6
set metric 200
exit

Love all the work you do Spirit and prox team! We continue to push the limits of Prox everyday and it never disappoints!
 
Hi,
They are a regression in frr when pve8 has been released, and "match evpn vni x" was not working anymore,
so I have replaced it by "match ip address prefix-list only_default"

https://git.proxmox.com/?p=pve-network.git;a=commit;h=e614da43f13e3c61f9b78ee9984364495eff91b6

I think this is fixed now in frr, do you have tried to add " match evpn vni x" again ?

Technically, you don't need to edit perl code, you could create a
/etc/frr/frr.conf.local

with for example only:

Code:
route-map MAP_VTEP_OUT permit 1
   match evpn vni 1234
exit


It should be merged in /etc/frr/frr.conf when you generate sdn config.
 
Hi,
They are a regression in frr when pve8 has been released, and "match evpn vni x" was not working anymore,
so I have replaced it by "match ip address prefix-list only_default"

https://git.proxmox.com/?p=pve-network.git;a=commit;h=e614da43f13e3c61f9b78ee9984364495eff91b6

I think this is fixed now in frr, do you have tried to add " match evpn vni x" again ?

Technically, you don't need to edit perl code, you could create a
/etc/frr/frr.conf.local

with for example only:

Code:
route-map MAP_VTEP_OUT permit 1
   match evpn vni 1234
exit


It should be merged in /etc/frr/frr.conf when you generate sdn config.
Cool thanks for your swift reply.

Yeah, it looks like they did patch it https://github.com/FRRouting/frr/issues/14419 - assuming should be good now.

I've gotta schedule up a maintenance window and I'll give it a test (full production here, so have to make changes gently).

I'll report back with the results regardless!

Update: Did a quick test, it looks like anything added to /etc/frr/frr.conf.local is being appended, so the sequence is after the permit or deny (depending on the rule).

Using your example:

Code:
route-map MAP_VTEP_OUT permit 1
   match evpn vni 1234
exit

Is appended like this:
Code:
!
ip prefix-list only_default seq 1 permit 0.0.0.0/0
!
ipv6 prefix-list only_default_v6 seq 1 permit ::/0
!
route-map MAP_VTEP_IN deny 1
 match ip address prefix-list only_default
exit
!
route-map MAP_VTEP_IN deny 2
 match ipv6 address prefix-list only_default_v6
exit
!
route-map MAP_VTEP_IN permit 3
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
route-map MAP_VTEP_OUT permit 2
 match evpn vni 1234
exit
!
line vty

Given this is production, pretty limited on the tests I can do live, perhaps for now, the quickest solution is to just modify the EvpnPlugin.pm script as we did previously?:

Replace:
Code:
push @{$routemap_config}, "match ip address prefix-list only_default";

With:
Code:
push @{$routemap_config}, "match evpn vni $vrfvxlan";
 
Last edited:
Update: Did a quick test, it looks like anything added to /etc/frr/frr.conf.local is being appended, so the sequence is after the permit or deny (depending on the rule).

yeah inserting rules is quite difficult. What I do is to remove the default route-map and replace it with a fully custom one

Code:
router bgp 65000

 address-family l2vpn evpn

  no neighbor VTEP route-map MAP_VTEP_IN in

  neighbor VTEP route-map MAP_VTEP_IN_CUSTOM in

 exit-address-family

exit

keep in mind, that you have to fully manually manage that one then.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!