EVPN+VPLS with multi exit nodes : firewall drop packet with asymetric routing

adofou · Nov 29, 2024

Hello,

I've been trying for weeks to set up a 3-node cluster with EVPN+VXLAN between the nodes, and a BGP controller that announces unicast prefixes to my network.
Each node has its own BGP controller and announces EVPN subnets. This now works fine, with a few filters at the entrance to my network.
On paper, traffic must exit at the nearest connection with the network, the local host, and ingress from internet as close as possible to the cluster from the network point of view, aka the node the near from ingress of packet into our network (even if it has to circulate in a VXLAN between two nodes).

The problem is that I've noticed some strange problems with VMs in this EVPN. Some destinations were unreachable (for example, a debian mirror).
And depending on the host whre is the VM, they weren't the same. One destination don't working on host, can working when VM was moved to another host.

The problem magically disappeared when I configured a node as a “Primary Exit Node”. No matter which node you select.
I've set all the uRPF check values as accurately as possible. Everything is disabled both from the network point of view, and on the Proxmox machines (I had missed this point, I find that the note on this subject is a little low, because only in example much lower than the help).

Code:

net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0

So it's not a VXLAN tunnel or EVPN announcement issue, since it works when a “Primary Exit Node” is configured. And through the right tunnels.
By the way, all BGP sessions are UP. And VMs on the 3 node can ping other VMs without worry.

So I continued debugging and realized that the problem only appeared if the routing was asymmetrical.
In other words, if the packet egress and ingress the same node from the network point of view, it worked.
But as soon as the packet egress the node locally (no Primary exit node), and ingress another node because of BGP routing (and therefore had to pass through EVPN+VXLAN), it no longer worked.

If output and input occur at the same node

Code:

EGRESS ICMP REQUEST :
01:47:49.054950 veth100i0 P   IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.054975 fwln100i0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.054976 fwpr100p0 P   IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.054976 evpn  In  IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.055003 vmbr0.3 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.055005 vmbr0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
01:47:49.055012 ens10f0np0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 40637, seq 1, length 64
INGRESS REPLY :
01:47:49.055497 ens10f0np0 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055497 vmbr0 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055497 vmbr0.3 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055525 evpn  Out IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055530 fwpr100p0 Out IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055532 fwln100i0 P   IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64
01:47:49.055544 veth100i0 Out IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 40637, seq 1, length 64

Note : It's from CT, but same from a VM

If, due to network routing, egress and ingress are not via the same node

Code:

Node 2 with VM, egress ICMP :
01:51:15.462128 tap105i0 P   IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462140 fwln105i0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462144 fwpr105p0 P   IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462147 evpn  In  IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462157 vmbr0.3 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462158 vmbr0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64
01:51:15.462162 ens10f0np0 Out IP 213.152.X.X > 8.8.8.8: ICMP echo request, id 579, seq 1, length 64

Node 1 - ingress (due to BGP perspective) :
01:51:15.463390 ens10f0np0 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 579, seq 1, length 64
01:51:15.463390 vmbr0 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 579, seq 1, length 64
01:51:15.463390 vmbr0.3 In  IP 8.8.8.8 > 213.152.X.X: ICMP echo reply, id 579, seq 1, length 64

Note : It's from VM, but same from a CT
We can see that, incomprehensibly, the package fits but stops at vmbr0.3 instead of vrfbr_evpn (the VRF bridge for EVPN+VXLAN)
However, the routing has not changed, the route is still the same and is present in the routing table.
213.152.X.X nhid 353 via 172.31.255.5 dev vrfbr_evpn proto bgp src 213.152.Y.Y metric 20 onlink

Worse still, during my troubleshooting, I noticed behavior that didn't make sense at the time.
A ping to 8.8.8.8 didn't work, but a DNS query did!

Code:

Node 2 with VM, egress DNS request :
02:04:01.597952 tap105i0 P   IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597964 fwln105i0 Out IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597968 fwpr105p0 P   IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597971 evpn  In  IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597979 vmbr0.3 Out IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597980 vmbr0 Out IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)
02:04:01.597983 ens10f0np0 Out IP 213.152.X.X.42023 > 8.8.8.8.53: 11008+ [1au] A? google.Fr. (50)

02:04:01.603899 vrfvx_evpn In  IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54) <- VXLAN from Node 1
02:04:01.603902 vrfbr_evpn In  IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.603907 evpn  Out IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.603909 fwpr105p0 Out IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.603911 fwln105i0 P   IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.603913 tap105i0 Out IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)

02:04:01.608743 tap105i0 P   IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608752 fwln105i0 Out IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608754 fwpr105p0 P   IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608756 evpn  In  IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608763 vmbr0.3 Out IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608764 vmbr0 Out IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)
02:04:01.608767 ens10f0np0 Out IP 213.152.X.X.48363 > 8.8.8.8.53: 17495+ [1au] A? 8.8.8.8. (48)

02:04:01.613895 vrfvx_evpn In  IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)  <- VXLAN from Node 1
02:04:01.613901 vrfbr_evpn In  IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.613915 evpn  Out IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.613921 fwpr105p0 Out IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.613925 fwln105i0 P   IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.613933 tap105i0 Out IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)


Node 1 - ingress (due to BGP perspective) :
02:04:01.604374 ens10f0np0 In  IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.604374 vmbr0 In  IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.604374 vmbr0.3 In  IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.604407 vrfbr_evpn Out IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)
02:04:01.604413 vrfvx_evpn Out IP 8.8.8.8.53 > 213.152.X.X.42023: 11008 1/0/1 A 142.250.179.99 (54)  -> To VXLAN

02:04:01.614347 ens10f0np0 In  IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.614347 vmbr0 In  IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.614347 vmbr0.3 In  IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.614379 vrfbr_evpn Out IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111)
02:04:01.614384 vrfvx_evpn Out IP 8.8.8.8.53 > 213.152.X.X.48363: 17495 NXDomain$ 0/1/1 (111) -> To VXLAN

I really racked my brains for a long time, then at some point I remembered that I had activated the firewall on the cluster.
So I started looking at this, and there are a lot of basic rules on the host that I don't necessarily understand.
So I decided to simply turn off the firewall on the cluster and ...see

And the magic works!

In other words, as soon as I activate the cluster's firewall, probably because of a rule or the conntrack, it completely breaks my asymmetrical incoming traffic.
The firewall should trigger a rule and delete the packet, instead of transferring it to vrfbr_evpn.

Does anyone have a solution to this problem?

Johann

spirit · Nov 29, 2024

what is your firewall rules between your hosts (firewall at host level) ? you need to have bgp && vxlan port open between hosts (in|out), or it'll drop traffic.

I'm not aware of a reported bug with asymetric routing && firewall.

but maybe you can try to add:

nf_conntrack_allow_invalid: 1

in /etc/pve/nodes/<nodename>/host.fw options.

(Personnaly , I don't use firewall on my hosts where I use evpn, as the evpn is running is a different vrf, the vms can't reach host ssh or other openeded port)

adofou · Nov 29, 2024

spirit said:
what is your firewall rules between your hosts (firewall at host level) ? you need to have bgp && vxlan port open between hosts (in|out), or it'll drop traffic.

No personal rules, I have just activate the firewall during configuration. No VM/CT rules either.
In attachment the current iptables rules (generated when I activate the firewall on cluster), network files and ip routes.
I put only NODE1 and NODE2, but I can provided the files for NODE3 if needed.

We use two loopback lo and lo:0.
Both permit me to test EVPN+VXLAN via publique or private vlan.
The goal is to create redundancy between the two network ports. Because connected to two different network equipments. Only the private part was kept during debugging. We advertise this loopback via BGP (via addition in /etc/frr/frr.conf.local) and setup the VXLAN over this tunnel.
I thought this might be the problem, so I rollback the VXLAN tunnels directly to the /31 IPs in the private VLAN (connected to each other by an L3VPN + a static route, BGP to come, I also have strange problems when accept this. So only the loopback for the moment. First things first).
But that didn't change the problem.

I don't see loopbacks in the firewall rules. But I still have a problem with direct interco IPs, which seem to be present in PVEFW-HOST-IN && PVEFW-HOST-OUT.

spirit said:
I'm not aware of a reported bug with asymetric routing && firewall.

but maybe you can try to add:

nf_conntrack_allow_invalid: 1

in /etc/pve/nodes/<nodename>/host.fw options.

We I adde this on the node1, the issues stopped and routing working.
This seems to confirm a problem with the conntrack on the NODE 1 (which has ingress traffic, but not egress).
The question is why, if you don't aware about a potential bug :-/

Stupid question : I have try to disable firewall on the node only, but seems be do nothing (always see rules in iptables).
So what is the goal of this option?

spirit said:
(Personnaly , I don't use firewall on my hosts where I use evpn, as the evpn is running is a different vrf, the vms can't reach host ssh or other openeded port)

In fact, I want to be able to put certain firewall settings directly on the hypervisor side, which would not be touchable from the VM (from end users). This is a requirement of our security team.
If I deactive firewall at cluster, that's disable this feature.
But if I deactivate firewall on node, that's the same. Or only disable firewalling on "host trafic"?

Many thanks!

spirit · Nov 30, 2024

adofou said:
In fact, I want to be able to put certain firewall settings directly on the hypervisor side, which would not be touchable from the VM (from end users). This is a requirement of our security team.
If I deactive firewall at cluster, that's disable this feature.
But if I deactivate firewall on node, that's the same. Or only disable firewalling on "host trafic"?

Many thanks!

mmm, disabling the firewall at host level, indeed only remove the host rules. But the contrackk is still here (because it's shared between all vms && host ). (They are a default rule on top of all other, looking in the conntrack for already established connection).

-A PVEFW-FORWARD -m conntrack --ctstate INVALID -j DROP #this rule is removed with nf_conntrack_allow_invalid
-A PVEFW-FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

I'll add a note in the doc about nf_conntrack_allow_invalid. With asymetric routing, it's quite possible that traffic is going out from a node, and coming back to another node where the conntrack was not opened (so the invalid rule is dropping packet).
I don't think we can do something about this (we don't have any conntrack sync between hosts)

shanreich · Dec 5, 2024

spirit said:
I don't think we can do something about this (we don't have any conntrack sync between hosts)

Yes, although this could be set up manually with conntrackd [1]. I've looked into integrating this with SDN a bit but there's so much stuff to do currently I don't know if/when I could get around to it.

[1] https://manpages.debian.org/testing/conntrackd/conntrackd.8.en.html

adofou · Dec 9, 2024

shanreich said:
Yes, although this could be set up manually with conntrackd [1]. I've looked into integrating this with SDN a bit but there's so much stuff to do currently I don't know if/when I could get around to it.

[1] https://manpages.debian.org/testing/conntrackd/conntrackd.8.en.html

Thank, good to know!
I'll be following the work on the subject. In the meantime, I think I'll do without

I have just one question, do you know why the file "/usr/lib/sysctl.d/pve-firewall.conf" content this?
Worse: this seems to be the case even with the firewall disabled on the cluster and/or host!

Code:

# Enables source route verification
net.ipv4.conf.all.rp_filter = 2

This seems to overwrite my /etc/sysctl.conf and break EVPN routing with my multi exit-node after each reboot.
Even if /etc/sysctl.conf were to be called last (which it probably isn't. Where this is perhaps only the case with this command)

Code:

root@prox1~# sysctl --system
* Applying /usr/lib/sysctl.d/10-pve-ct-inotify-limits.conf ...
* Applying /usr/lib/sysctl.d/10-pve.conf ...
* Applying /usr/lib/sysctl.d/50-pid-max.conf ...
* Applying /usr/lib/sysctl.d/99-protect-links.conf ...
* Applying /etc/sysctl.d/99-sysctl.conf ...
* Applying /usr/lib/sysctl.d/pve-firewall.conf ...
* Applying /etc/sysctl.conf ...
fs.inotify.max_queued_events = 8388608
fs.inotify.max_user_instances = 65536
fs.inotify.max_user_watches = 4194304
vm.max_map_count = 262144
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv6.neigh.default.gc_thresh3 = 8192
kernel.keys.maxkeys = 2000
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
net.ipv4.igmp_link_local_mcast_reports = 0
fs.aio-max-nr = 1048576
kernel.pid_max = 4194304
fs.protected_fifos = 1
fs.protected_hardlinks = 1
fs.protected_regular = 2
fs.protected_symlinks = 1
vm.swappiness = 100
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.all.rp_filter = 2
vm.swappiness = 100
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0

After a reboot :

Code:

root@prox1:~# sysctl -a | grep rp_filter
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.ens10f0np0.arp_filter = 0

Need to execute "systectl -p" manually to correct this.

Thanks!

spirit · Dec 9, 2024

adofou said:
Thank, good to know!
I'll be following the work on the subject. In the meantime, I think I'll do without

I have just one question, do you know why the file "/usr/lib/sysctl.d/pve-firewall.conf" content this?
Worse: this seems to be the case even with the firewall disabled on the cluster and/or host!

Code:

# Enables source route verification net.ipv4.conf.all.rp_filter = 2

ah..!!!... I was not aware of this config. It must be deployed with pve-firewall package. (If you delete it, I'm not sure if it'll be reinstalled again on package update)

for now; as workaround, you can set net.ipv4.conf.all.rp_filter = 0 in "/usr/lib/sysctl.d/pve-firewall.conf", and "chattr +i "/usr/lib/sysctl.d/pve-firewall.conf", to avoid to overwrite it.

hthpr · Dec 12, 2024

spirit said:
ah..!!!... I was not aware of this config. It must be deployed with pve-firewall package. (If you delete it, I'm not sure if it'll be reinstalled again on package update)

for now; as workaround, you can set net.ipv4.conf.all.rp_filter = 0 in "/usr/lib/sysctl.d/pve-firewall.conf", and "chattr +i "/usr/lib/sysctl.d/pve-firewall.conf", to avoid to overwrite it.

I had the same issue. If you create a file like /etc/sysctl.d/z-10-fix-traffic.conf, sysctl will overwrite the setting correctly. You can see the order files are read with sysctl --system command:
* Applying /usr/lib/sysctl.d/10-pve-ct-inotify-limits.conf ...
* Applying /usr/lib/sysctl.d/10-pve.conf ...
* Applying /etc/sysctl.d/30-ceph-osd.conf ...
* Applying /usr/lib/sysctl.d/50-pid-max.conf ...
* Applying /usr/lib/sysctl.d/99-protect-links.conf ...
* Applying /etc/sysctl.d/99-sysctl.conf ...
* Applying /usr/lib/sysctl.d/pve-firewall.conf ...
* Applying /etc/sysctl.d/z-10-fix-traffic.conf ...
* Applying /etc/sysctl.conf ...

file order is lowest -> highest priority. Probably pve-firewall.conf should get a new name like 10-pve-firewall.conf. Then it would be easier to overwrite the setting.

adofou · Dec 13, 2024

hthpr said:
file order is lowest -> highest priority. Probably pve-firewall.conf should get a new name like 10-pve-firewall.conf. Then it would be easier to overwrite the setting.

That was kind of what came up when I was discussing it with a friend this week.
sysctl.conf always seems to take priority over everything else (like it or not, that's another debate), but the structure of the filename of pve-firwall.conf seems to give it priority (priority order : letter after numbers).

Spirit told me that this file dates from a 2016 patch, back when SDN wasn't even imagined. I don't know if we remember why.
But I don't know what impact of name change for this file would have on Proxmox's production, to be honest. I know people have modified this file to add things to it.

I suppose creating a new file like you can be a temporary solution to leave pve-firewall.conf "package standard".
And that a more in-depth reflection on the integration of this option and SDN in the future.

I think adding a note on this in the documentation might be useful for other users

spirit · Dec 13, 2024

hthpr said:
I had the same issue. If you create a file like /etc/sysctl.d/z-10-fix-traffic.conf, sysctl will overwrite the setting correctly. You can see the order files are read with sysctl --system command:
* Applying /usr/lib/sysctl.d/10-pve-ct-inotify-limits.conf ...
* Applying /usr/lib/sysctl.d/10-pve.conf ...
* Applying /etc/sysctl.d/30-ceph-osd.conf ...
* Applying /usr/lib/sysctl.d/50-pid-max.conf ...
* Applying /usr/lib/sysctl.d/99-protect-links.conf ...
* Applying /etc/sysctl.d/99-sysctl.conf ...
* Applying /usr/lib/sysctl.d/pve-firewall.conf ...
* Applying /etc/sysctl.d/z-10-fix-traffic.conf ...
* Applying /etc/sysctl.conf ...

file order is lowest -> highest priority. Probably pve-firewall.conf should get a new name like 10-pve-firewall.conf. Then it would be easier to overwrite the setting.

Ah good catch ! renaming file make sense indeed. I'll look to send a patch next week for next pve-firewall package version.

binarynightowl · Jan 3, 2025

spirit said:
Ah good catch ! renaming file make sense indeed. I'll look to send a patch next week for next pve-firewall package version.

Did you end up patching this? I just ran into this and was going to fix it manually but then I found this thread and saw you are working on it.

Search

Search

EVPN+VPLS with multi exit nodes : firewall drop packet with asymetric routing

adofou

Active Member

Attachments

spirit

Distinguished Member

adofou

Active Member

Attachments

spirit

Distinguished Member

shanreich

Proxmox Staff Member

adofou

Active Member

spirit

Distinguished Member

hthpr

Active Member

adofou

Active Member

spirit

Distinguished Member

binarynightowl

New Member

We value your privacy