EVPN SDN not working across multiple nodes

lunam

New Member
Apr 1, 2022
11
1
3
I'm trying to setup an EVPN based SDN across 2 proxmox nodes. I have 2 nodes with EVPN setup (and they have peered and are exchanging prefixes). These nodes also peer over standard BGP using the BGP controller to my switch, which is receiving the subnets. My switch is configured for ECMP, and I can see the multiple paths to each route:

1666719048458.png - what is weird is that for the /32 routes, the nexthop is the pve host where the VM is not hosted? i.e: 10.1.0.3/32 should be on 10.0.50.3

My VMs can ping out to the internet and to anything on my network, but when they are on 2 different nodes they can't ping each other. Also, I can't ping the VMs from my external network. I have made sure to disable the firewall so that isn't an issue. Here is the zone config:

1666719036069.png

..and here is the EVPN controller config:

1666719263815.png

..and here is the BGP controller config:

1666719305007.png

The VNETs are nothing special with source NAT disabled. Not sure where to go from here.
 
Weirdly enough, sometimes I do actually get a ping response and then it goes quiet:
1666794811140.png
 

Attachments

  • 1666794787484.png
    1666794787484.png
    17.2 KB · Views: 1
It seems this is an issue with ECMP? After I removed the BGP controller for the second node, I can now access the machines on the first node but not the second. It's like the VXLAN isn't actually working between the nodes
 
Hi, did you found the solution? Unfortunately I don't have any, but I encountered a very similar problem but on a simpler setup that shouldn't be a problem to reproduce.
Maybe someone would be able to spot what we are missing here.

I have 2 proxmox nodes Dev1 (10.2.5.1/24) and Dev2 (10.2.5.2/24)

My SDN configuration is:

Code:
Dev1:~# cat /etc/pve/sdn/*.cfg
evpn: control1
        asn 65655
        peers 10.2.5.1,10.2.5.2


subnet: zone1-10.2.99.0-24
        vnet z1vnet99
        gateway 10.2.99.100


subnet: zone1-10.2.98.0-24
        vnet z1vnet98
        gateway 10.2.98.100


vnet: z1vnet99
        zone zone1
        tag 10


vnet: z1vnet98
        zone zone1
        tag 20


evpn: zone1
        controller control1
        vrf-vxlan 222
        advertise-subnets 1
        ipam pve
        mac 52:E6:87:55:FC:13

Then I created 3 containers
  • Dev1:
    • CT102 (z1vnet99 10.2.99.1/24, gw: 10.2.99.100)
    • CT104 (z1vnet98 10.2.98.1/24, gw: 10.2.98.100)
  • Dev2:
    • CT103 (z1vnet98 10.2.98.2/24, gw: 10.2.98.100)

I would expect that all CTs should be accessible from each other, because they all share same zone, but it is not a case.

This is working as expected (in both directions):
  • Ping from CT104 to CT103 is working OK - test same network across multiple nodes - it would suggest that VxLAN part is working ok because packets has to go throught underlayer network between Dev1 and Dev2
  • Ping from CT104 to CT102 is working OK - test if routing between networks is working

But when I try to combine both scenarios it fails
  • Ping from CT102 to CT103 fails with Destination Host Unreachable - it seems that controller does not know where to route, but vrf route table is populated and working and vxlan should also work, what I am missing here?

I've been trying to grasp evpn-vxlan only for a few days now, so it could totally be some stupid config error, but I am not able to spot reason why this config should not work. I am missing something obvious or it is a bug of some sort?

Bash:
Dev1:~# pveversion
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.104-1-pve)

Bash:
Dev1:~# ip route show vrf vrf_zone1
unreachable default metric 4278198272
10.2.98.0/24 dev z1vnet98 proto kernel scope link src 10.2.98.100
10.2.99.0/24 dev z1vnet99 proto kernel scope link src 10.2.99.100

Bash:
Dev1:~# cat /etc/network/interfaces.d/sdn
#version:31


auto vrf_zone1
iface vrf_zone1
        vrf-table auto
        post-up ip route add vrf vrf_zone1 unreachable default metric 4278198272


auto vrfbr_zone1
iface vrfbr_zone1
        bridge-ports vrfvx_zone1
        bridge_stp off
        bridge_fd 0
        mtu 1450
        vrf vrf_zone1


auto vrfvx_zone1
iface vrfvx_zone1
        vxlan-id 222
        vxlan-local-tunnelip 10.2.5.1
        bridge-learning off
        bridge-arp-nd-suppress on
        mtu 1450


auto vxlan_z1vnet98
iface vxlan_z1vnet98
        vxlan-id 20
        vxlan-local-tunnelip 10.2.5.1
        bridge-learning off
        bridge-arp-nd-suppress on
        mtu 1450


auto vxlan_z1vnet99
iface vxlan_z1vnet99
        vxlan-id 10
        vxlan-local-tunnelip 10.2.5.1
        bridge-learning off
        bridge-arp-nd-suppress on
        mtu 1450


auto z1vnet98
iface z1vnet98
        address 10.2.98.100/24
        hwaddress 52:E6:87:55:FC:13
        bridge_ports vxlan_z1vnet98
        bridge_stp off
        bridge_fd 0
        mtu 1450
        ip-forward on
        arp-accept on
        vrf vrf_zone1


auto z1vnet99
iface z1vnet99
        address 10.2.99.100/24
        hwaddress 52:E6:87:55:FC:13
        bridge_ports vxlan_z1vnet99
        bridge_stp off
        bridge_fd 0
        mtu 1450
        ip-forward on
        arp-accept on
        vrf vrf_zone1


Bash:
Dev1:~# cat /etc/frr/frr.conf
frr version 8.2.2
frr defaults datacenter
hostname Dev1
log syslog informational
service integrated-vtysh-config
!
!
vrf vrf_zone1
 vni 222
exit-vrf
!
router bgp 65655
 bgp router-id 10.2.5.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 neighbor VTEP peer-group
 neighbor VTEP remote-as 65655
 neighbor VTEP bfd
 neighbor 10.2.5.2 peer-group VTEP
 !
 address-family l2vpn evpn
  neighbor VTEP route-map MAP_VTEP_IN in
  neighbor VTEP route-map MAP_VTEP_OUT out
  neighbor VTEP activate
  advertise-all-vni
 exit-address-family
exit
!
router bgp 65655 vrf vrf_zone1
 bgp router-id 10.2.5.1
 !
 address-family ipv4 unicast
  redistribute connected
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute connected
 exit-address-family
 !
 address-family l2vpn evpn
  advertise ipv4 unicast
  advertise ipv6 unicast
 exit-address-family
exit
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
line vty

================

Just noticed, when Disable Arp-Nd Suppression option is enabled it sudennly start working. And when tunrned off and change zone mac address (to fully forget) it is not working again. So would it indicate that some part of MAC discovery or EVPN multicast is not wokring properly?

Code:
Disable Arp-Nd Suppression
Optional. Don’t suppress ARP or ND packets. This is required if you use floating IPs in your guest VMs (IP are MAC addresses are being moved between systems).
 
Last edited:
@wardpire

"disable arp-nd suppression" shouldn't be necessary, until you have some some vip, where the ip is moving to a new node, but not the mac address.

Frr is learning mac && ip, looking in arp table of the host, and bridge mac address table of the vnet

the disable arp-nd suppression, allow the arp && nd packets to be forwarded on the vxlan (so other nodes can see new mac/address location).
This is not needed normally, as mac && ip are exchanged with the bgp protocol.
(It's helping to disable it, for vip moving without the mac, for faster listening, or you'll need to wait 30s for refresh when arp table expire on source host)


stupid question:
are you sure to no have duplicated mac address in your containers nic ?


could you send result of

# vtysh -c "sh bgp l2vpn evpn

on each node ?
 
Last edited:
  • Like
Reactions: wardpire
Thank you for quick response. Initially all CTs were clones, so it would be easily overlooked. But proxmox generates random MAC addresses for clones and all interfaces were recreted in testing anyway, so unfortunatelly no, MAC addresses are not the same.

Bash:
root@test:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth99@if74: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 82:d5:a0:9f:1a:8a brd ff:ff:ff:ff:ff:ff link-netnsid 0


root@test:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth98@if50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 42:6f:43:60:21:ad brd ff:ff:ff:ff:ff:ff link-netnsid 0


root@test:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth98@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether de:be:28:38:b9:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
 
Last edited:
Sure:

Dev1
Bash:
Dev1:~# vtysh -c "sh bgp l2vpn evpn"
BGP table version is 22, local router ID is 10.2.5.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[EthTag]:[ESI]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]


   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 10.2.5.1:2
*> [5]:[0]:[24]:[10.2.98.0]
                    10.2.5.1(Dev1)
                                             0         32768 ?
                    ET:8 RT:119:224 Rmac:56:84:c4:ba:03:6d
*> [5]:[0]:[24]:[10.2.99.0]
                    10.2.5.1(Dev1)
                                             0         32768 ?
                    ET:8 RT:119:224 Rmac:56:84:c4:ba:03:6d
Route Distinguisher: 10.2.5.1:3
*> [3]:[0]:[32]:[10.2.5.1]
                    10.2.5.1(Dev1)
                                                       32768 i
                    ET:8 RT:119:20
Route Distinguisher: 10.2.5.1:4
*> [2]:[0]:[48]:[82:d5:a0:9f:1a:8a]
                    10.2.5.1(Dev1)
                                                       32768 i
                    ET:8 RT:119:10
*> [2]:[0]:[48]:[82:d5:a0:9f:1a:8a]:[32]:[10.2.99.1]
                    10.2.5.1(Dev1)
                                                       32768 i
                    ET:8 RT:119:10 RT:119:224 Rmac:56:84:c4:ba:03:6d
*> [3]:[0]:[32]:[10.2.5.1]
                    10.2.5.1(Dev1)
                                                       32768 i
                    ET:8 RT:119:10
Route Distinguisher: 10.2.5.2:3
*>i[3]:[0]:[32]:[10.2.5.2]
                    10.2.5.2(Dev2)
                                                  100      0 i
                    RT:119:20 ET:8
Route Distinguisher: 10.2.5.2:4
*>i[3]:[0]:[32]:[10.2.5.2]
                    10.2.5.2(Dev2)
                                                  100      0 i
                    RT:119:10 ET:8


Displayed 8 out of 8 total prefixes

Dev2
Bash:
Dev2:~# vtysh -c "sh bgp l2vpn evpn"
BGP table version is 37, local router ID is 10.2.5.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[EthTag]:[ESI]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]


   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 10.2.5.1:3
*>i[3]:[0]:[32]:[10.2.5.1]
                    10.2.5.1(Dev1)
                                                  100      0 i
                    RT:119:20 ET:8
Route Distinguisher: 10.2.5.1:4
*>i[2]:[0]:[48]:[82:d5:a0:9f:1a:8a]
                    10.2.5.1(Dev1)
                                                  100      0 i
                    RT:119:10 ET:8
*>i[3]:[0]:[32]:[10.2.5.1]
                    10.2.5.1(Dev1)
                                                  100      0 i
                    RT:119:10 ET:8
Route Distinguisher: 10.2.5.2:2
*> [5]:[0]:[24]:[10.2.98.0]
                    10.2.5.2(Dev2)
                                             0         32768 ?
                    ET:8 RT:119:224 Rmac:56:84:c4:ba:03:6d
*> [5]:[0]:[24]:[10.2.99.0]
                    10.2.5.2(Dev2)
                                             0         32768 ?
                    ET:8 RT:119:224 Rmac:56:84:c4:ba:03:6d
Route Distinguisher: 10.2.5.2:3
*> [2]:[0]:[48]:[de:be:28:38:b9:f3]
                    10.2.5.2(Dev2)
                                                       32768 i
                    ET:8 RT:119:20
*> [2]:[0]:[48]:[de:be:28:38:b9:f3]:[32]:[10.2.98.2]
                    10.2.5.2(Dev2)
                                                       32768 i
                    ET:8 RT:119:20 RT:119:224 Rmac:56:84:c4:ba:03:6d
*> [3]:[0]:[32]:[10.2.5.2]
                    10.2.5.2(Dev2)
                                                       32768 i
                    ET:8 RT:119:20
Route Distinguisher: 10.2.5.2:4
*> [3]:[0]:[32]:[10.2.5.2]
                    10.2.5.2(Dev2)
                                                       32768 i
                    ET:8 RT:119:10


Displayed 9 out of 9 total prefixes
 
mmm, I don't see any problem.

I have tried to reproduce your setup on my side, with 2 debian10 ct, and it's working out of the box.

I really don't known why it's working only for you with "Disable Arp-Nd Suppression"...
 
Yes, all IPs of the gateway are accessible from each node and always has same correct MAC address:

CT102
Bash:
root@test:~# arp 10.2.99.100
Address                  HWtype  HWaddress           Flags Mask            Iface
10.2.99.100              ether   52:e6:87:55:fc:16   C                     eth99


test:~# ping -c 1 10.2.99.100
PING 10.2.99.100 (10.2.99.100) 56(84) bytes of data.
64 bytes from 10.2.99.100: icmp_seq=1 ttl=64 time=0.101 ms


--- 10.2.99.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.101/0.101/0.101/0.000 ms


test:~# ping -c 1 10.2.98.100
PING 10.2.98.100 (10.2.98.100) 56(84) bytes of data.
64 bytes from 10.2.98.100: icmp_seq=1 ttl=64 time=0.063 ms


--- 10.2.98.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.063/0.063/0.063/0.000 ms

CT104
Bash:
test:~# arp 10.2.98.100
Address                  HWtype  HWaddress           Flags Mask            Iface
10.2.98.100              ether   52:e6:87:55:fc:16   C                     eth98

test:~# ping -c 1 10.2.99.100
PING 10.2.99.100 (10.2.99.100) 56(84) bytes of data.
64 bytes from 10.2.99.100: icmp_seq=1 ttl=64 time=0.093 ms

--- 10.2.99.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.093/0.093/0.093/0.000 ms

test:~# ping -c 1 10.2.98.100
PING 10.2.98.100 (10.2.98.100) 56(84) bytes of data.
64 bytes from 10.2.98.100: icmp_seq=1 ttl=64 time=0.087 ms

--- 10.2.98.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.087/0.087/0.087/0.000 ms

CT103
Bash:
test:~# ping -c 1 10.2.99.100 
PING 10.2.99.100 (10.2.99.100) 56(84) bytes of data.
64 bytes from 10.2.99.100: icmp_seq=1 ttl=64 time=0.130 ms

--- 10.2.99.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.130/0.130/0.130/0.000 ms

test:~# ping -c 1 10.2.98.100
PING 10.2.98.100 (10.2.98.100) 56(84) bytes of data.
64 bytes from 10.2.98.100: icmp_seq=1 ttl=64 time=0.120 ms

--- 10.2.98.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.120/0.120/0.120/0.000 ms

I tried it with debian images too, but with same result. I also make unique hostnames for CTs. Firewall setting was default one from proxmox installation, but I tried turn off everything from proxmox webgui (also without success).

For the full picture, my nodes is a virtualized proxmox cluster (nested virtualization) so underlying "physical network" is in fact linux bridge, but I guess it shouldn't affect anything.

Thank you for your time! I appreciate that. For now I can use it without ARP suppression for dev purposes - it is not mission critical for now, but I was curious about the technology and if I'am configuring it correctly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!