I mean, the vm firewall bridges. (my hosts have around 150vm, with 3 interfaces, so 450fwbr bridges). And frr is listening to netlink coming from all bridges.
for vnet, I think I have around 40~50 vnet by host.
I have 100k evpn routes.
It's nice to know that I've not scaled past where others have had success.
(for my case, I was also wondering if it could be a flood of netlink with frr is reloading on every host at the same time, I have seen that some buffer can be tuned here:
https://github.com/FRRouting/frr/issues/18680 ,
https://github.com/FRRouting/frr/discussions/16486 )
Another thing in my setup, I'm using ecmp with 4 path, and I just found an issue with a workaround here:
https://vyos.dev/T5424
I did not see the errors that the other users reported in our logs, so I'm thinking we may not have this specific issue.
As I mentioned yesterday, there is an issue with our freshly patched and rebooted host. It may or may not be caused by the same underlying problem.
Symptoms:
After migrating to the affected host, the VM may stop responding to pings for a period of a few seconds to over 20 minutes. This ping is originating from my workstation which is outside of the virtual infrastructure.
Unlike the first VM/Host issue, this VM appears to allow most communications. It's communicating with NTP servers on the internet and DNS servers inside of our PVE cluster. I only saw that the ping was impacted.
Findings:
I found that the incoming ping packets are being dropped by the firewall between the zone vrf bridge (vrf_BSDnet) and the vnet bridge (buildup).
After enabling firewall logging on the affected host, it logged the drops:
Code:
0 5 PVEFW-HOST-OUT 19/Aug/2025:11:30:11 -0700 policy DROP: OUT=buildup SRC=10.1.220.60 DST=10.7.1.103 LEN=60 TOS=0x00 PREC=0x00 TTL=124 ID=36189 PROTO=ICMP TYPE=8 CODE=0 ID=1 SEQ=51263
The host the vm is migrating from appears to have a significant impact on whether it the issue occurs or not. We have one host that causes failure most of the time and another that is successful most of the time. Both of these hosts were patched to current, including frr, and rebooted recently. The difference is that one is a SDN exit node (mostly successful) and the other (mostly failure) is not.
From within the guest VM, pinging the gateway IP, an unused IP on the subnet, another VM on the same or another subnet, has no impact.
Disabling the host firewall prevents the issue from happening.
Changing the host firewall to use nftables prevents the issue from happening.
Disabling ebtables in the datacenter firewall configuration had no impact.
Executing "pve-firewall restart" has no impact.
We tested, small scale, for several months with the SDN configuration, minus the production sdn vnets. Then we started migrating the production workload and sdn vnets in May with minimal issues. We didn't start to encounter significant issues until we were adding the final hosts to the cluster at the end of July. There were no config changes at that time, except for adding the hosts.
Relevent configuration:
Datacenter firewall:
Status: Enabled
Input and Output policy: Drop
Forward policy: Accept
Rules: A couple of temporary rules and two security groups intended to allow acceptable host communications
Security Groups, Aliases, IPsets: Many
Host firewall:
Status: Enabled
Rules: none
nf_conntrack_allow_invalid: 1
Other settings: default
Vnet firewall: All are default: Disabled, Forward policy accept, no rules
PVE VM firewall of the test VM: Status: disabled
Guest VM firewall (inside of the vm): disabled
A typical VM firewall config for us would be enabled, input and output policy set to drop, and several security groups included from the datacenter configuration.
Thoughts and Assumptions:
I've not seen packets dropped in this location prior to this. It seems like a place where a vnet firewall would take action, if it was enabled.
Since the problem resolves on it's own after a random interval and doesn't occur with nftables enabled, it seems less likely to be an errant firewall rule and more likely to be a bug of some kind.
The reason I said this could be the same underlying issue is because if firewall errantly drops arp or all traffic, I think it would have produced the same results seen when I originally created this thread.
When a VM migrates, does the firewall connection tracking information migrate with VM? If yes, that might explain why the source host of a migration impacts the likelihood of the problem to present itself.
Thanks,
Erik