exitnode local-routing breaks EVPN SDN

skukunin

New Member
Aug 20, 2024
5
3
3
Hello. Sorry, not a network specialist here, but a devops (higher level). Fairly spent about 4 days, read tons of manuals, and fundamental documents, and still confused about how EVPN SDN is supposed to work and why it doesn't work in my case. The goal is to have a L2 VXLan for my VMs regardless of the node they are running on.

I have a bunch of questions, and hopefully, someone can help me fill the gaps and help with my network setup. Because I'm going to work a lot with it, I really want to dive deep so there is no gaps.

I have three nodes cluster. each node has only single NIC, with a public /29 network assigned. I configured only EVPN controller with all peers using their public IPs. I added static routes to the peers via gateway, so BGP peers can be connected. I configured each node as an exit node, so there is a gateway and SNAT configured, which is super useful.

I already spent couple days, but so far it works fine, and every VM can ping any other VM in the same VNet, regardless the node. But, problems start when I try to enable "local-routing" setting to get access from my nodes.

Problem 1. It doesn't work. Node 1 still can't ping VMs on Node 2. VXLan works fine, I see traffic on 4789 port. BUT, I don't understand how it supposed to work at all:

both nodes have 10.255.255.1 and 10.255.255.2 veths. The ICMP packet with SRC IP 10.255.255.1 and DST IP 10.1.0.15 (VM on node 2) gets sent from node 1 to node 2. It's routed perfectly, VM receives it and replies with a ICMP packet SRC IP 10.1.0.15 and DST IP 10.255.255.1. There is 10.255.255.1 on node 2, so it's get routed to it, instead of getting back to node 1. How is it supposed to work at all? I guess, to make it work, each node should have a unique IP address.

Is it supposed to work only with a single exit node?

Problem 2. VM internet access stops working. While a VM can ping any address in internet (L3), L4 traffic is broken. Here is a tcpdump for a DNS request from a VM:
Code:
root@nocix-kz-1:/home/customer# tcpdump -ni any host 192.187.107.16
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
05:58:33.416077 veth104i0 P   IP 10.1.0.2.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416089 fwln104i0 Out IP 10.1.0.2.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416090 fwpr104p0 P   IP 10.1.0.2.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416090 platform In  IP 10.1.0.2.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416131 vmbr0 Out IP 69.197.xxx.xx.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416135 enp5s0f1 Out IP 69.197.xxx.xx.53468 > 192.187.107.16.53: 44383+ A? ifconfig.me. (29)
05:58:33.416328 enp5s0f1 In  IP 192.187.107.16.53 > 69.197.xxx.xx.53468: 44383 1/0/0 A 34.160.111.145 (45)
05:58:33.416328 vmbr0 In  IP 192.187.107.16.53 > 69.197.xxx.xx.53468: 44383 1/0/0 A 34.160.111.145 (45)
05:58:33.416360 xvrf_primary Out IP 192.187.107.16.53 > 10.1.0.2.53468: 44383 1/0/0 A 34.160.111.145 (45)
05:58:33.416363 xvrfp_primary In  IP 192.187.107.16.53 > 10.1.0.2.53468: 44383 1/0/0 A 34.160.111.145 (45)
05:58:33.416394 platform Out IP 192.187.107.16.392 > 10.1.0.2.53468: UDP, length 45
05:58:33.416400 fwpr104p0 Out IP 192.187.107.16.392 > 10.1.0.2.53468: UDP, length 45
05:58:33.416402 fwln104i0 P   IP 192.187.107.16.392 > 10.1.0.2.53468: UDP, length 45
05:58:33.416407 veth104i0 Out IP 192.187.107.16.392 > 10.1.0.2.53468: UDP, length 45
05:58:33.416437 veth104i0 P   IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81
05:58:33.416440 fwln104i0 Out IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81
05:58:33.416441 fwpr104p0 P   IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81
05:58:33.416441 platform In  IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81
05:58:33.416451 vmbr0 Out IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81
05:58:33.416454 enp5s0f1 Out IP 10.1.0.2 > 192.187.107.16: ICMP 10.1.0.2 udp port 53468 unreachable, length 81

The packet went out from `platform` interface, but the response is going via `xvrf_primary`. Eventually it goes to `platform` interface, but notice how source port gets changed between these two hops:

Code:
05:58:33.416363 xvrfp_primary In  IP 192.187.107.16.53 > 10.1.0.2.53468: 44383 1/0/0 A 34.160.111.145 (45)
05:58:33.416394 platform Out IP 192.187.107.16.392 > 10.1.0.2.53468: UDP, length 45

I guess it happens because xvrfp_primary knows nothing about the original packet. I still don't understand how it's supposed to work.

Problem 3 (bonus). I don't understand the point of the VRF VXLAN ID, and why it's a required option. I have My VNet (platform) with assigned ID. It has each bridge that has connected to each VM. Also, I see that these vrfvx_primary with vrfbr_primary are created.

Sometimes, `show bgp l2vpn evpn` returns both routes to VM via VNet ID, or via VRF VXLAN ID. I'm not sure if it caused any problem, but I saw some VNET packets going via VRF VXLAN ID, instead of My VNet ID. It ends onto vrfbr_primary that is not connected to any VM - the packets might get lost.

If I check my VRF routes, I see that VM IPs are routed via vrfbr_primary, not via platform bridge (which is connected to the VM itself). I see no way how packets from vrfbr_primary would move to the primary - these two bridges are not connected and have own VXLan interfaces.

Code:
# ip route show vrf vrf_primary
10.1.0.10 nhid 304 via 63.141.xxx.xx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.12 nhid 303 via 63.141.xxx.xxx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.14 nhid 303 via 63.141.xxx.xxx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.15 nhid 304 via 63.141.xxx.xx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.36 nhid 303 via 63.141.xxx.xxx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.37 nhid 303 via 63.141.xxx.xxx dev vrfbr_primary proto bgp metric 20 onlink
10.1.0.185 nhid 304 via 63.141.xxx.xx dev vrfbr_primary proto bgp metric 20 onlink

How is this supposed to work? I've found in the code that in theory it was optional: https://github.com/proxmox/pve-network/blob/master/src/PVE/Network/SDN/Zones/EvpnPlugin.pm#L229.

---
So far by PVE SDN configuration is the following:

Code:
# cat /etc/pve/sdn/*
evpn: primary
    asn 65000
    peers 69.197.xxx.xx,63.141.xxx.xx,63.141.xxx.xxx

subnet: primary-10.1.0.0-22
    vnet platform
    gateway 10.1.0.1
    snat 1

vnet: platform
    zone primary
    tag 101

evpn: primary
    controller primary
    vrf-vxlan 100
    exitnodes nocix-kz-3,nocix-kz-2,nocix-kz-1
    exitnodes-local-routing 1
    ipam pve
    mac BC:24:11:3A:B9:6E

Trying to keep it simple and as closest to the stock as possible. Not sure why I'm having all of these problems, but it's a good teacher.

Thank you. Sorry if any of my questions are lame.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!