VM loss of communications in EVPN after SDN Apply

What would you consider a large number of bridges? When I first started looking into this problem, I noticed we had about 2000 evpn routes in bgp and searched for frr limits, but didn't find any. Our hosts have about 100 bridges from vnets alone.
I mean, the vm firewall bridges. (my hosts have around 150vm, with 3 interfaces, so 450fwbr bridges). And frr is listening to netlink coming from all bridges.
for vnet, I think I have around 40~50 vnet by host.

I have 100k evpn routes.

(for my case, I was also wondering if it could be a flood of netlink with frr is reloading on every host at the same time, I have seen that some buffer can be tuned here: https://github.com/FRRouting/frr/issues/18680 , https://github.com/FRRouting/frr/discussions/16486 )
 
Last edited:
I mean, the vm firewall bridges. (my hosts have around 150vm, with 3 interfaces, so 450fwbr bridges). And frr is listening to netlink coming from all bridges.
for vnet, I think I have around 40~50 vnet by host.

I have 100k evpn routes.
It's nice to know that I've not scaled past where others have had success.

(for my case, I was also wondering if it could be a flood of netlink with frr is reloading on every host at the same time, I have seen that some buffer can be tuned here: https://github.com/FRRouting/frr/issues/18680 , https://github.com/FRRouting/frr/discussions/16486 )
Another thing in my setup, I'm using ecmp with 4 path, and I just found an issue with a workaround here:
https://vyos.dev/T5424
I did not see the errors that the other users reported in our logs, so I'm thinking we may not have this specific issue.


As I mentioned yesterday, there is an issue with our freshly patched and rebooted host. It may or may not be caused by the same underlying problem.

Symptoms:

After migrating to the affected host, the VM may stop responding to pings for a period of a few seconds to over 20 minutes. This ping is originating from my workstation which is outside of the virtual infrastructure.

Unlike the first VM/Host issue, this VM appears to allow most communications. It's communicating with NTP servers on the internet and DNS servers inside of our PVE cluster. I only saw that the ping was impacted.

Findings:

I found that the incoming ping packets are being dropped by the firewall between the zone vrf bridge (vrf_BSDnet) and the vnet bridge (buildup).

After enabling firewall logging on the affected host, it logged the drops:
Code:
0 5 PVEFW-HOST-OUT 19/Aug/2025:11:30:11 -0700 policy DROP: OUT=buildup SRC=10.1.220.60 DST=10.7.1.103 LEN=60 TOS=0x00 PREC=0x00 TTL=124 ID=36189 PROTO=ICMP TYPE=8 CODE=0 ID=1 SEQ=51263

The host the vm is migrating from appears to have a significant impact on whether it the issue occurs or not. We have one host that causes failure most of the time and another that is successful most of the time. Both of these hosts were patched to current, including frr, and rebooted recently. The difference is that one is a SDN exit node (mostly successful) and the other (mostly failure) is not.

From within the guest VM, pinging the gateway IP, an unused IP on the subnet, another VM on the same or another subnet, has no impact.

Disabling the host firewall prevents the issue from happening.

Changing the host firewall to use nftables prevents the issue from happening.

Disabling ebtables in the datacenter firewall configuration had no impact.

Executing "pve-firewall restart" has no impact.

We tested, small scale, for several months with the SDN configuration, minus the production sdn vnets. Then we started migrating the production workload and sdn vnets in May with minimal issues. We didn't start to encounter significant issues until we were adding the final hosts to the cluster at the end of July. There were no config changes at that time, except for adding the hosts.

Relevent configuration:

Datacenter firewall:

Status: Enabled
Input and Output policy: Drop
Forward policy: Accept
Rules: A couple of temporary rules and two security groups intended to allow acceptable host communications
Security Groups, Aliases, IPsets: Many

Host firewall:

Status: Enabled
Rules: none
nf_conntrack_allow_invalid: 1
Other settings: default

Vnet firewall: All are default: Disabled, Forward policy accept, no rules

PVE VM firewall of the test VM: Status: disabled

Guest VM firewall (inside of the vm): disabled

A typical VM firewall config for us would be enabled, input and output policy set to drop, and several security groups included from the datacenter configuration.


Thoughts and Assumptions:

I've not seen packets dropped in this location prior to this. It seems like a place where a vnet firewall would take action, if it was enabled.

Since the problem resolves on it's own after a random interval and doesn't occur with nftables enabled, it seems less likely to be an errant firewall rule and more likely to be a bug of some kind.

The reason I said this could be the same underlying issue is because if firewall errantly drops arp or all traffic, I think it would have produced the same results seen when I originally created this thread.

When a VM migrates, does the firewall connection tracking information migrate with VM? If yes, that might explain why the source host of a migration impacts the likelihood of the problem to present itself.

Thanks,

Erik
 
I figured out problem 2. It is not the same as the original problem.

Problem: The PVEFW-HOST-OUT firewall chain is inappropriately applied to in-transit EVPN traffic.

Steps to reproduce:

FRR 10.2.3-1+pve1
Datacenter firewall enabled with output policy: drop
VNet firewall was disabled during testing
VM firewall status had no impact to the results
Ping VM from outside the virtual infrastructure, so that the packets enter the host server directly, not through vxlan.
Ensure no connection tracking entry from prior tests exists on host

Results:

Ping packets will not be delivered to the VM even though there are no firewall rules blocking it

This behavior does not exist when using frr 8.5.2-1+pve1


The reason this issue is intermittent in our configuration is that the firewall rule chain blocking the traffic only does so if the packets are coming from the physical network interfaces. If the packets came via vxlan, they were passed to the VM. This creates a connection tracking entry for the flow because the firewall rule chain allows established connections. This allows packets to continue to pass, regardless of path taken until the connection tracking entry expired.

The reason for the random interval before the vm would respond to pings was that external influences in our network caused the ping packet to be delivered to a different host and then forwarded to the correct host via vxlan.

The reason that the issue presented itself more or less often for certain hosts is that the hosts have differing upstream network connections to our main network. After the VM is migrated, the external network routing tables take time to update, which depending on where the VM was and is now, in relation to the pinging workstation would cause zero to a few packets being sent to the wrong host and then forwarded via vxlan.

To show the issue further, I captured packets from the host using "tcpdump -npi any" so that the path the packets take through the SDN layers is shown. I also filtered it to a single packet received of each type to make it easy to read.

Code:
## packet from external network delivered directly to the host for the vm
# During this test there were 20-30 minutes of these packets

12:56:10.484401 eno12409np1 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45952, length 40
12:56:10.484401 bond0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45952, length 40
12:56:10.484401 bond0.150 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45952, length 40
12:56:10.484401 vmbr0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45952, length 40
12:56:10.484418 vrf_BSDnet Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45952, length 40

# pve-firewall log entry
0 5 PVEFW-HOST-OUT 20/Aug/2025:12:56:11 -0700 policy DROP: OUT=buildup SRC=10.1.220.60 DST=10.7.1.103 LEN=60 TOS=0x00 PREC=0x00 TTL=124 ID=38165 PROTO=ICMP TYPE=8 CODE=0 ID=2 SEQ=45952 


## packet from external network that was delivered to a different host and forwarded via vxlan
# There were a few of these packets

12:56:15.557642 eno12409np1 In  IP 10.3.150.109.46439 > 10.3.150.105.4789: VXLAN, flags [I] (0x08), vni 1001096
IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45958, length 40

12:56:15.557642 vxlan_buildup P   IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45958, length 40
12:56:15.557674 fwpr123p0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45958, length 40
12:56:15.557676 fwln123i0 P   IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45958, length 40
12:56:15.557682 tap123i0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45958, length 40

# This created the following connection tracking entry
# icmp     1 29 src=10.1.220.60 dst=10.7.1.103 type=8 code=0 id=2 src=10.7.1.103 dst=10.1.220.60 type=0 code=0 id=2 mark=0 use=1


## packet from external network delivered directly to the host for the vm, after connection tracking entry created

12:56:16.504479 eno12409np1 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504479 bond0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504479 bond0.150 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504479 vmbr0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504493 vrf_BSDnet Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504501 buildup Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504504 fwpr123p0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504506 fwln123i0 P   IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40
12:56:16.504514 tap123i0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 45960, length 40

This is a packet trace showing the packet flow when using frr 8.4. Note that vrf_BSDnet is not present in this trace.
Code:
13:31:13.208804 eno12409np1 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208804 bond0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208804 bond0.150 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208804 vmbr0 In  IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208823 buildup Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208825 fwpr123p0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208826 fwln123i0 P   IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40
13:31:13.208836 tap123i0 Out IP 10.1.220.60 > 10.7.1.103: ICMP echo request, id 2, seq 49963, length 40

I previously mentioned that changing the host to nftables also fixed the issue. It turns out that checking the box did fix the issue, but in the wrong way. There was an error installing the firewall rules that I didn't notice so the test was invalid.


Potential workarounds:

I could downgrade back to frr 8.4. It may have intermittent problems when applying sdn changes, but those can be mitigated by evacuating a host, then applying the change.

Another option would be to add a datacenter firewall rule that would allow the impacted traffic. Further testing would need to ensure that there are no adverse affects to vxlan tunneled traffic, either at the source or destination host.

Any other ideas or opinions?

Thanks,

Erik
 
Seems like the traffic gets dropped on the buildup interface, I suspect that possibly the blanket DROP rule on the FORWARD chain for invalid conntrack is responsible. Would be interesting to investigate further. The nftables firewall should have no such blanket DROP rule, so it would be interesting to see if a functioning nftables ruleset would prevent that issue.

I previously mentioned that changing the host to nftables also fixed the issue. It turns out that checking the box did fix the issue, but in the wrong way. There was an error installing the firewall rules that I didn't notice so the test was invalid.

This is often caused by either:
  1. IPSets still have the legacy names in the firewall ruleset (without dc/ or guest/ prefix)
  2. IPSets have overlapping IP ranges which makes nftables choke
Both have been on my list to fix for quite awhile, but I haven't gotten around yet. You can check systemctl status proxmox-firewall for more information (nftables should have better error output since PVE 9).

Could you also post the resulting routes for the guests in the default routing table for both FRR versions? I wonder why the old FRR version doesn't hit the VRF, which might be explained by the generated routes.
 
Could you also post the resulting routes for the guests in the default routing table for both FRR versions? I wonder why the old FRR version doesn't hit the VRF, which might be explained by the generated routes.

VM IP is 10.7.1.103. The VM was migrated to each host before running the commands

FRR 10:
Code:
# ip route list |grep 10.7.1
10.7.1.96/27 nhid 58 dev vrf_BSDnet proto bgp metric 20 
10.7.1.120 nhid 728 via 10.6.150.105 dev vrfbr_BSDnet proto bgp metric 20 onlink 

# vtysh -c "show ip route 10.7.1.103"
Routing entry for 10.7.1.96/27
  Known via "bgp", distance 20, metric 0, best
  Last update 17:29:23 ago
  * directly connected, vrf_BSDnet(vrf vrf_BSDnet), weight 1

FRR 8.4:
Code:
# ip route list | grep 10.7.1
10.7.1.96/27 nhid 442 dev buildup proto bgp metric 20 
10.7.1.120 nhid 625 via 10.6.150.105 dev vrfbr_BSDnet proto bgp metric 20 onlink 

# vtysh -c "show ip route 10.7.1.103"
Routing entry for 10.7.1.96/27
  Known via "bgp", distance 20, metric 0, best
  Last update 02w6d01h ago
  * directly connected, buildup(vrf vrf_BSDnet), weight 1

Thanks,

Erik
 
Interesting, that would explain the difference in why it hits the VRF for FRR 10, but doesn't in FRR 8. Then it probably triggers the host chain? I'd have to try and reproduce on my machine to be sure.

Does creating a blanket ACCEPT rule on Datacenter level on that specific VRF interface - locked down to the IP ranges from EVPN - work?
 
The following rule did not work:
Code:
OUT ACCEPT -i vrf_BSDnet -dest 10.7.0.0/16 -log notice # TEMP Testing Rule for FRR 10

The following rule did work:
Code:
OUT ACCEPT -i buildup -dest 10.7.0.0/16 -log notice # TEMP Testing Rule for FRR 10

Thanks,

Erik
 
I'm gotten my cluster upgraded to FRR 10 and the pending SDN changes installed. It took a couple of days because we wanted to evacuate each node first. The final firewall rule that we used to work around the firewall blocking in-transit sdn traffic is below. We installed this rule in the host firewall table rather than datacenter, so that we can test it on a single host without impacting the entire cluster.

Code:
OUT ACCEPT -dest 10.7.0.0/16 -log nolog

While getting this done, we had a few VMs lose communications, until they were migrated. Our plan for the next few days is to let it sit and ensure it's stable, then start doing small things to build confidence that it's stable.

Should I file a bug report for the FRR 10 / firewall dropping in-transit packets?

Thanks,

Erik
 
I'm gotten my cluster upgraded to FRR 10 and the pending SDN changes installed. It took a couple of days because we wanted to evacuate each node first. The final firewall rule that we used to work around the firewall blocking in-transit sdn traffic is below. We installed this rule in the host firewall table rather than datacenter, so that we can test it on a single host without impacting the entire cluster.

Code:
OUT ACCEPT -dest 10.7.0.0/16 -log nolog

While getting this done, we had a few VMs lose communications, until they were migrated. Our plan for the next few days is to let it sit and ensure it's stable, then start doing small things to build confidence that it's stable.

Should I file a bug report for the FRR 10 / firewall dropping in-transit packets?

Thanks,

Erik

Yes please, you can also mention me - I hope I can find the time to further look into this. Could you also provide the information from this thread (FRR configuration, SDN configuration + tcpdump output) so we have it in a centralized place?