VM loss of communications in EVPN after SDN Apply

Erik Horn

Member
Jul 25, 2024
32
8
8
PVE 8.4.1
frr and frr-pythontools have been rolled back to 8.5.2-1+pve1 due to bfd issues with v10

We've encountered an issue in our cluster where some VMs lose connectivity after clicking Apply in the SDN configuration. In at least one case, the issue was triggered with no configuration changes being applied.

When this happens, it seems to impact all VMs on the host, and often multiple hosts. It does not impact all hosts in the cluster. The only way we've found to get communications back is to migrate affected VMs to a different host. What's weirder is that we can immediately migrate the VM back to the host it was on and it remains functional.

We managed to get lucky and currently have a test VM exhibiting this issue on a host with no other VMs on it, so I can troubleshoot the issue without users complaining about services being down while I try to migrate a couple hundred vms.

Pings from systems external to the virtual infrastructure as well as VMs connected to other vnets receive a Destination host unreachable from the gateway. Pings from a vm connected to the same vnet succeed.

Using tcpdump on the host, I found that affected VM is rapidly receiving arp requests, several per second, from the gateway ip. The VM is replying to the arp requests and the VM firewall is not blocking them. This behavior indicates that the gateway either didn't receive the arp reply, or it was determined to be invalid.

The host arp table has a valid entry for the IP address of the broken VM.

I've tried restarting frr and pve-firewall and neither help. When reviewing the logs on the host, I didn't see anything that indicates a problem. I don't know where to look next. I have some hosts that need to be moved in and out of the cluster, so I not to the point where I can stop making changes and let it run. Any assistance or insight would be appreciated.

Thanks,

Erik
 
Using tcpdump on the host, I found that affected VM is rapidly receiving arp requests, several per second, from the gateway ip. The VM is replying to the arp requests and the VM firewall is not blocking them. This behavior indicates that the gateway either didn't receive the arp reply, or it was determined to be invalid.
Do you have ARP suppression disabled?

When this happens, it seems to impact all VMs on the host, and often multiple hosts. It does not impact all hosts in the cluster. The only way we've found to get communications back is to migrate affected VMs to a different host. What's weirder is that we can immediately migrate the VM back to the host it was on and it remains functional.
Can you check whether the affected nodes can establish a BGP session and get routes?

Code:
vtysh -c 'show bgp summary'
vtysh -c 'show bgp l2vpn evpn'
 
The bgp summary shows working bgp sessions to both the upstream physcial routers and all the other pve hosts.

The evpn route list has around 2000 routes in it. However, the route for the impacted VM is missing.
 
Is this a silent VM? Does pinging the gateway from inside the VM make the route appear?
 
For the test case only, it's fedora live cd without any storage, specifically set up for testing. However, when I looked it was receiving arp requests and responding at a rate of several per second.

From the VM, I pinged the default gateway. Communications resumed until 10 seconds after stopping the ping. During this test, there is no /32 route in the kernel or bgp routing tables. All of our other VMs have /32 routes in both places.


Thanks,

Erik
 
While reviewing the thread, I realized I didn't answer the question about arp supression. In the evpn zone configuration, the Disable ARP-ng supression checkbox is not checked.
 
From the VM, I pinged the default gateway. Communications resumed until 10 seconds after stopping the ping. During this test, there is no /32 route in the kernel or bgp routing tables. All of our other VMs have /32 routes in both places.
No /32 routes also during the ping? The issue usually is that silent VMs do not get entries in ip neigh or they expire after awhile. Do the other hosts continously send traffic and therefore not run into this issue perhaps? It might be that after applying the configuration, FRR loses the routes momentarily after applying the configuration and FRR gets restarted. If the VMs do not send traffic after, it is possible that the EVPN routes never get created.

Do you have the task log for the SRV Reload Networking tasks on hosts where you know the issue appeared after reloading?


Although this is a bit curious:
The host arp table has a valid entry for the IP address of the broken VM.
Do you have the ip neigh output of the host which contains the VM at a time where there is no route in the EVPN table?


frr and frr-pythontools have been rolled back to 8.5.2-1+pve1 due to bfd issues with v10
We'll be deploying an updated FRR version soon that should contain a backported fix, I'll mention it here when it hits the repositories.
 
Do you have the task log for the SRV Reload Networking tasks on hosts where you know the issue appeared after reloading?
I don't have the full log, but I did note that only output that seemed important was that all or most hosts where reporting the following. The problem seemed to happen as each host refreshed.

Code:
vrf_BSDnet : warning: vrf_BSDnet: post-up cmd 'ip route del vrf vrf_BSDnet unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process

From the VM, I pinged the default gateway. Communications resumed until 10 seconds after stopping the ping. During this test, there is no /32 route in the kernel or bgp routing tables. All of our other VMs have /32 routes in both places.

No /32 routes also during the ping? The issue usually is that silent VMs do not get entries in ip neigh or they expire after awhile. Do the other hosts continously send traffic and therefore not run into this issue perhaps? It might be that after applying the configuration, FRR loses the routes momentarily after applying the configuration and FRR gets restarted. If the VMs do not send traffic after, it is possible that the EVPN routes never get created.
When I was preparing this reply I discovered I'd made an error with this prior test. I'd pinged the subnet IP rather than the gateway. Pinging the default gateway from the broken guest does not impact the problem. I did find that pinging any unused IP on the subnet does allow communications until about 10 seconds after the pings stop.

I've done a bunch of digging through a couple lines of thought, gathered a lot of information, and came to some conclusions/assumptions. I'm not sure of the best way to present it all, so I'll start with the conclusions/assumptions, go on to what I did and found, then end with a potentially relevent configuration information. I hope it isn't too confusing or information overload.

Conclusion/Assumption: Whatever process is responsible for synchronizing the kernel arp table into the evpn, upon sdn refresh, starts ignoring VMs that existed at the time of the refresh. The problem is not always triggered when refreshing the SDN configuration. VMs created or migrated to the host after the refresh are not affected.

Reason: The output from "ip neighbor" shows the VM's IP address. The output from "show bgp l2vpn evpn" does not have a matching entry. I duplicated the affected VM and started it up on the same host. It works properly.


Troubleshooting track #1:

I previously mentioned that the VM with communications issues was receiving, and replying to arp requests from the gateway at a rate of 1/s or faster. I was able to determine the reason for this.

We have "advertise subnets" enabled in our configuration. This was done to ease troubleshooting, so pings and traceroutes for unused IPs in allocated ranges reach the PVE environment. It also allows the normal arp process to find quiet VMs.

Our configuration has all nodes but one are configured as an exit and entry node.

When pinging our broken VM from a workstation external to the PVE infrastructure, we found that the following happened:

Ping received by pmhost-dsc-8
pmhost-dsc-8 doesn't have an arp entry for the IP, so it generated an arp request from the subnet gateway IP
the arp request is broadcast to all SDN zone members
pmhost-cc-1 recieves the arp request, forwards it to the VM
VM replies to arp
pmhost-cc-1 receives the arp reply and processes it by updating the entry in the kernel
the arp reply is not forwarded to the originator
pmhost-dsc-8 didn't get the arp reply therefore responds to the external IP with destination host unreachable
As long as traffic is received by any host for the broken vm, this process will repeat, causing arp requests to created once per second.


Troubleshooting track #2:

What the arp and routing tables look like for the broken VM when it's doing nothing and when it's pinging an unknown ip on the local subnet.

In the output from these commands, I filtered it down to the following IPs:

10.7.1.96/27: The subnet the VM is connected to
10.7.1.97: The gateway
10.7.1.103: A clone of the broken vm, on the same host, that was created after the problem started. This vm works correctly.
10.7.1.106: The broken vm
10.7.1.120: Another VM on the same subnet, but different host. It works correctly.

From the broken host. I'm attempting to ping the vm from outside of PVE. The VM is is receiving and replying to arp requests at least once per second.
Code:
# ip route|grep 10.7.1
10.7.1.96/27 nhid 9480 dev buildup proto bgp metric 20 
10.7.1.120 nhid 20098 via 10.3.150.106 dev vrfbr_BSDnet proto bgp metric 20 onlink 

# ip nei|grep 10.7.1
10.7.1.120 dev buildup lladdr bc:24:11:97:a6:d9 extern_learn NOARP proto zebra 
10.7.1.106 dev buildup lladdr bc:24:11:68:3a:36 REACHABLE 
10.7.1.103 dev buildup lladdr bc:24:11:7b:db:e9 REACHABLE 

# vtysh -c "show ip route" |fgrep 10.7.1
B>* 10.7.1.96/27 [20/0] is directly connected, buildup (vrf vrf_BSDnet), weight 1, 5d02h08m
B>* 10.7.1.120/32 [200/0] via 10.3.150.106, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 05:05:36

# vtysh -c "show bgp l2vpn evpn" |egrep '68:3a:36|7b:db:e9|10\.7\.1\.'
 *>i[2]:[0]:[48]:[bc:24:11:97:a6:d9]:[32]:[10.7.1.120]
 *> [2]:[0]:[48]:[bc:24:11:7b:db:e9]
 *> [2]:[0]:[48]:[bc:24:11:7b:db:e9]:[32]:[10.7.1.103]
 *>i[5]:[0]:[27]:[10.7.1.96]

From the host where the pings are entering the SDN from the external network. The VM is otherwise not doing anything.
Code:
# ip route|grep 10.7.1
10.7.1.96/27 nhid 440 dev buildup proto bgp metric 20 
10.7.1.103 nhid 34689 via 10.6.150.101 dev vrfbr_BSDnet proto bgp metric 20 onlink 
10.7.1.120 nhid 943 via 10.3.150.106 dev vrfbr_BSDnet proto bgp metric 20 onlink 

# ip nei|grep 10.7.1
10.7.1.120 dev buildup lladdr bc:24:11:97:a6:d9 extern_learn NOARP proto zebra 
10.7.1.103 dev buildup lladdr bc:24:11:7b:db:e9 extern_learn NOARP proto zebra 
10.7.1.106 dev buildup INCOMPLETE 

# vtysh -c "show ip route" |fgrep 10.7.1
B>* 10.7.1.96/27 [20/0] is directly connected, buildup (vrf vrf_BSDnet), weight 1, 4d03h39m
B>* 10.7.1.103/32 [200/0] via 10.6.150.101, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 00:13:30
B>* 10.7.1.120/32 [200/0] via 10.3.150.106, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 05:10:10

# vtysh -c "show bgp l2vpn evpn" |egrep '68:3a:36|7b:db:e9|10\.7\.1\.'
 *>i[2]:[0]:[48]:[bc:24:11:97:a6:d9]:[32]:[10.7.1.120]
 *>i[2]:[0]:[48]:[bc:24:11:7b:db:e9]
 *>i[2]:[0]:[48]:[bc:24:11:7b:db:e9]:[32]:[10.7.1.103]
 *>i[5]:[0]:[27]:[10.7.1.96]

From the broken host, while the VM is pinging an unknown IP on the local subnet.
Code:
# ip route|grep 10.7.1
10.7.1.96/27 nhid 9480 dev buildup proto bgp metric 20 
10.7.1.120 nhid 20098 via 10.3.150.106 dev vrfbr_BSDnet proto bgp metric 20 onlink

# ip nei|grep 10.7.1
10.7.1.120 dev buildup lladdr bc:24:11:97:a6:d9 extern_learn NOARP proto zebra 
10.7.1.106 dev buildup lladdr bc:24:11:68:3a:36 REACHABLE 
10.7.1.103 dev buildup lladdr bc:24:11:7b:db:e9 REACHABLE

# vtysh -c "show ip route" |fgrep 10.7.1
B>* 10.7.1.96/27 [20/0] is directly connected, buildup (vrf vrf_BSDnet), weight 1, 4d07h28m
B>* 10.7.1.103/32 [200/0] via 10.6.150.101, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 04:01:59
B>* 10.7.1.120/32 [200/0] via 10.3.150.106, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 08:58:39

# vtysh -c "show bgp l2vpn evpn" |egrep '68:3a:36|7b:db:e9|10\.7\.1\.'
 *>i[2]:[0]:[48]:[bc:24:11:97:a6:d9]:[32]:[10.7.1.120]
 *> [2]:[0]:[48]:[bc:24:11:7b:db:e9]
 *> [2]:[0]:[48]:[bc:24:11:7b:db:e9]:[32]:[10.7.1.103]
 *>i[5]:[0]:[27]:[10.7.1.96]

From the host where the pings are entering the SDN from the external network, while the VM is pinging an unknown IP on the local subnet.
Code:
# ip route|grep 10.7.1
10.7.1.96/27 nhid 440 dev buildup proto bgp metric 20 
10.7.1.103 nhid 34689 via 10.6.150.101 dev vrfbr_BSDnet proto bgp metric 20 onlink 
10.7.1.120 nhid 943 via 10.3.150.106 dev vrfbr_BSDnet proto bgp metric 20 onlink 

# ip nei|grep 10.7.1
10.7.1.120 dev buildup lladdr bc:24:11:97:a6:d9 extern_learn NOARP proto zebra 
10.7.1.103 dev buildup lladdr bc:24:11:7b:db:e9 extern_learn NOARP proto zebra 
10.7.1.106 dev buildup FAILED 
### The entry for 10.7.1.106 cycles between DELAY, PROBE, FAILED, and INCOMPLETE

# vtysh -c "show ip route" |fgrep 10.7.1
B>* 10.7.1.96/27 [20/0] is directly connected, buildup (vrf vrf_BSDnet), weight 1, 4d07h34m
B>* 10.7.1.103/32 [200/0] via 10.6.150.101, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 04:08:23
B>* 10.7.1.120/32 [200/0] via 10.3.150.106, vrfbr_BSDnet (vrf vrf_BSDnet) onlink, weight 1, 09:05:03

# vtysh -c "show bgp l2vpn evpn" |egrep '68:3a:36|7b:db:e9|10\.7\.1\.'
 *>i[2]:[0]:[48]:[bc:24:11:97:a6:d9]:[32]:[10.7.1.120]
 *>i[2]:[0]:[48]:[bc:24:11:7b:db:e9]
 *>i[2]:[0]:[48]:[bc:24:11:7b:db:e9]:[32]:[10.7.1.103]
 *>i[5]:[0]:[27]:[10.7.1.96]

Additional configuration information:

Controllers.cfg, the evpn zone, and a representative BGP uplink
Code:
evpn: BSDnet
        asn 65200
        peers 10.3.150.104,10.3.150.105,10.3.150.106,10.3.150.107,10.3.150.108,10.3.150.109,10.6.150.101,10.6.150.104,10.6.150.105,10.6.150.106,10.6.150.107,10.6.150.108,10.6.150.109,10.254.30.230

bgp: bgppmhost-cc-1
        asn 65200
        node pmhost-cc-1
        peers 10.6.150.10,10.6.150.11
        bgp-multipath-as-path-relax 0
        ebgp 1

zones.cfg
Code:
evpn: BSDnet
        controller BSDnet
        vrf-vxlan 1000000
        advertise-subnets 1
        exitnodes pmhost-dsc-6,pmhost-dsc-4,pmhost-cc-5,pmhost-cc-4,pmhost-dsc-9,pmhost-cc-9,pmhost-cc-6,pmhost-cc-1,pmhost-dsc-5,pmhost-dsc-8,pmhost-dsc-7,pmhost-cc-8,pmhost-cc-7
        ipam pve
        mac BC:24:11:A8:25:90
        mtu 9148
        nodes pmhost-dsc-8,pmhost-dsc-7,pmhost-cc-8,pmhost-cc-7,pmhost-cc-9,pmhost-cc-6,pmhost-cc-1,pmhost-dsc-5,pmhost-cc-4,pmhost-dsc-9,pmhost-dsc-6,pmhost-dsc-4,pmhost-cc-5,pmhost-witness

The vnet.cfg definition for the vnet used by the broken VM
Code:
vnet: buildup
        zone BSDnet
        alias buildup
        tag 1001096

The subnet.cfg definition for the subnet used by the broken VM
Code:
subnet: BSDnet-10.7.1.96-27
        vnet buildup
        gateway 10.7.1.97

Thanks,

Erik
 
It seems like the issue is then that for that particular VM FRR does not announce a Type 2 Route via EVPN. Is this an isolated host, or a host that you can isolate? Then we could turn on some FRR debugging options and check if something interesting is hidden there.
 
It seems like the issue is then that for that particular VM FRR does not announce a Type 2 Route via EVPN. Is this an isolated host, or a host that you can isolate? Then we could turn on some FRR debugging options and check if something interesting is hidden there.
Yes, I can run debugging commands on this host.

Thanks,

Erik
 
frr and frr-pythontools have been rolled back to 8.5.2-1+pve1 due to bfd issues with v10

FYi we have deployed a fix for the BFD issues with FRR 10.2.3-1+pve1 to our no-subscription repositories.


Can you add the following lines to your frr.conf (right under log syslog):

Code:
debug zebra events
debug zebra vxlan
debug zebra rib
debug zebra kernel
debug zebra nexthop
debug bgp zebra


and change

Code:
log syslog informational

to:

Code:
log syslog debug


Make sure that there is an entry in ip neigh for the broken VM and then restart FRR and recheck ip neigh. Can you then post the logs:

Code:
journalctl -u > frr_debug.txt
 
I have not yet tried the newer version of frr. I plan to do this after posting this message.

I made the config changes, restarted frr, and captured the log after it had been running for a while.

My not-expert assessment of the log is that the broken VM, IP and Mac are getting added to the evpn, and then removed shortly after.

In trying to better understand the logs, my googling led me to the following command that shows the mac and IPs are in the evpn, but inactive. Again, not sure why, but I thought it might be useful. The first command is for the broken vm, and the second is the working vm on the same host.

Code:
root@pmhost-cc-1:~# vtysh -c 'show evpn mac vni 1001096 mac bc:24:11:68:3a:36'
MAC: bc:24:11:68:3a:36
 Auto Mac 
 Sync-info: neigh#: 0
 Local Seq: 0 Remote Seq: 0
 Uptime: 06:27:04
 Neighbors:
    10.7.1.106 Inactive
    fe80::f59e:8dc6:fd97:2d4 Inactive

root@pmhost-cc-1:~# vtysh -c 'show evpn mac vni 1001096 mac bc:24:11:7b:db:e9'
MAC: bc:24:11:7b:db:e9
 Intf: fwpr123p0(556) VLAN: 0
 Sync-info: neigh#: 0
 Local Seq: 0 Remote Seq: 0
 Uptime: 06:27:15
 Neighbors:
    10.7.1.103 Active

I trimmed it to only include output from after frr was started. I tried to attach it, but it was too large. I shared it via google docs.

Useful information:

Host: pmhost-cc-1, 10.6.150.101/24
Vnet/Subnet: Name buildup, IP Range 10.7.1.96/27, Gateway 10.7.1.97, VNI 1001096
Broken VM: ID 247, IP 10.7.1.106, Mac bc:24:11:68:3a:36
Working VM on the same host: ID 123, IP 10.7.1.103, Mac bc:24:11:7b:db:e9

Thanks,

Erik
 
Upgraded to frr 10.2.3-1+pve1 on the impacted host. The BFD issue we were experiencing with older versions of v10 have been resolved.

The issue with from this post remains. I did not see any change in behaviour after the update. I did capture a new log frr_10.log

Thanks,

Erik
 
I've continued my attempts to troubleshoot this issue and believe the log snippit below is the likely cause of the issue I'm seeing. It appears shortly after restarting frr on the impacted node. The snippint below is newer with additional debugging options enabled so it's not exactly the same as the uploaded logs from yesterday.

In the uploaded logs, it's found starting on line 2884 of the original v8 log, and line 3039 of the v10 log. Across the many logs I've reviewed, the IP of the VTEP claiming the IP address has not changed.

Code:
1024074-Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [KMXEB-K771Y] netlink_parse_info: netlink-cmd (NS 0) type RTM_NEWNEIGH(28), len=68, seq=107, pid=4173761997
1024075:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [HM5M4-AQPPX] Rx RTM_NEWNEIGH AF_BRIDGE IF 372 st 0x2 fl 0x12 MAC bc:24:11:68:3a:36 dst 10.3.150.105 nhg 0 vni 0
1024076:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [QEDXC-E5122] dpAdd remote MAC bc:24:11:68:3a:36 VID 1
1024077:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [VAK89-SNH8J] Add/update remote MAC bc:24:11:68:3a:36 intf vxlan_buildup(372) VNI 1001096 flags 0x1 - del local
1024078:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [RWQPR-6BEC9] Send MACIP Del f None  state 1 MAC bc:24:11:68:3a:36 IP (null) seq 0 L2-VNI 1001096 ESI - to bgp
1024079:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [JWQ3J-TKSAT] zebra_evpn_mac_del: MAC bc:24:11:68:3a:36 flags LOC
1024080:Aug 13 14:29:51 pmhost-cc-1 zebra[3656309]: [N0ZA5-S7FHE] VNI 1001096 MAC bc:24:11:68:3a:36 unlinked from ifp fwpr247p0 (546)

What I believe this means, is that a routing update was received that indicated the mac address of my broken VM is active on another vtep 10.3.150.105. It then disconnects it.

The VTEP mentioned is a PVE cluster member. I can't find the mac address in the ip neighbor list or frr evpn routes, arp-cache, etc... The MAC address is not duplicated between VMs.

Additional troubleshooting steps that I tried:

evacuated the vms from the 10.3.150.105 host.
restarted frr on the host containing the broken vm. Problem and log entries still present.
restarted frr on the 10.3.150.105 host.
restarted frr on the host containing the broken vm. Problem and log entries still present.
upgraded frr on the 10.3.150.105 host to the v10 version, and ensured it restarted
restarted frr on the host containing the broken vm. Problem and log entries still present.
stopped frr on the 10.3.150.105 host
restarted frr on the host containing the broken vm. Problem and log entries still present.
started frr on the 10.3.150.105 host
on the host containing the broken vm, ran "clear bgp l2vpn evpn *"
restarted frr on the host containing the broken vm. Problem and log entries still present.

Based on these results, It seems like this mac address is stuck somewhere. It's not coming from the 10.3.150.105 host either.

Unless somebody has a better idea, I think the next steps are to go through each node, evacuate the VMs and upgrade frr. Once all of the nodes are running the same version, test again.

Thanks,

Erik
 
Possibly changing the MAC address of the VM could resolve this problem? It would still be interesting nevertheless to find out what's causing this.


What I believe this means, is that a routing update was received that indicated the mac address of my broken VM is active on another vtep 10.3.150.105. It then disconnects it.
I think this means that the neighbor table gets a new entry, since RTM_NEWNEIGH is a netlink message that indicates such an event (see [1]) - so the interesting part would be to find out why this neighbor table entry gets created. So probably not received as an update via BGP, but rather as a response to a message from the kernel?

This then seems to in turn call zebra_vxlan_check_del_local_mac, which logs the 'Add/update remote MAC message'. The description of this says the following:

Code:
/*
 * Handle notification of MAC add/update over VxLAN. If the kernel is notifying
 * us, this must involve a multihoming scenario. Treat this as implicit delete
 * of any prior local MAC.
 */

Which seems to explain the behavior.

Maybe tcpdumping on the bridge would show packets that cause the kernel to learn the wrong destination for your MAC address?

[1] https://man7.org/linux/man-pages/man7/rtnetlink.7.html
 
Seems like the same happens for 10.7.1.106 as well:

Code:
Aug 12 15:34:40 pmhost-cc-1 zebra[771154]: [KKAC1-JMWTB] Rx RTM_NEWNEIGH family ipv4 IF buildup(373) vrf vrf_BSDnet(219) IP 10.7.1.106 MAC bc:24:11:68:3a:36 state 0x2 flags 0x0 ext_flags 0x0
Aug 12 15:34:40 pmhost-cc-1 zebra[771154]: [J1Q9Y-TFAYN] Add/Update neighbor 10.7.1.106 MAC bc:24:11:68:3a:36 intf buildup(373) state 0x2 -> L2-VNI 1001096
Aug 12 15:34:40 pmhost-cc-1 zebra[771154]: [Q256S-D2B4T] AUTO MAC bc:24:11:68:3a:36 created for neigh 10.7.1.106 on VNI 1001096
Aug 12 15:34:40 pmhost-cc-1 zebra[771154]: [JWQ3J-TKSAT] zebra_evpn_mac_add: MAC bc:24:11:68:3a:36 flags None

Instead of grepping for the IP, could you try grepping for the MAC address in the neighbor table and see if there are any entries?
 
I think this means that the neighbor table gets a new entry, since RTM_NEWNEIGH is a netlink message that indicates such an event (see [1]) - so the interesting part would be to find out why this neighbor table entry gets created. So probably not received as an update via BGP, but rather as a response to a message from the kernel?
You are correct. I confirmed it by capturing packets on the vnet bridge and host uplink while restarting frr. The mac address/route was removed before any bgp sessions opened, vxlan traffic received at the host, or any traffic destined for the guest.

Knowing that the source of the issue was not external narrowed the search. I still saw nothing suspicious in the neighbor table. But then I found the bridge fdb table. There I found what I thought was suspicious stuff, such as old remote entries for my broken vm and the broken vm mac address associated with a firewall device of the working vm. Unfortunately I didn't copy the data.

While looking at the bridge tables, I was migrating my working vm in and out of the host and restarting frr while watching how it's bridge entries changed, in an attempt to figure out what it should look like. Then the working vm suddenly wasn't working any more, showing the same symptoms as the broken vm. I then backtracked the last few things I'd done to isolate the issue further.

I found that migrating the broken vm to another host caused it to resume network communications, even after migrating it back to the original host. This is the same as originally reported. However, if a broken vm is running on the host where it initially stopped working (migrated away and back again), restarting frr will stop communications. This confirmed that there was an issue with the host that isn't random, and that what I thought was a resolution to the many vms losing communications (migrating them to another host) did not solve the underlying issue.

At this point, I believed the issue is in the bridging table, but not knowing exactly what the issue is, and getting frustrated trying to figure out the solution, I decided to take the sledgehammer approach and simply start clearing the bridging tables associated with the broken VMs until I cleared them all or the problem disappeared.

I started with "bridge fdb flush dev vxlan_buildup" since that is where all of the VTEP IPs were listed and that somehow played into the issue. By itself, that didn't fix it. After clearing the entries though, nothing recreated the ones that should have been there, so I restarted frr. Entries were created in that table and the broken VMs started working. I tried everything I could think of to break them again: migrating them back and forth, as well as restarting frr. They kept on working. My assumption is that I cleared out whatever was broken.

Since I could no longer reproduce the issue, there was no reason to keep the system in it's current state, I triggered a network reload to remove manual modifications to frr.conf, patched it to current and rebooted it.

After the reboot, I migrated the test VMs back to it. Tried again to break it and luckily failed. When reviewing the bridge table entries, there's still entries that seem suspicous, but it's working so it must be normal.

I'm going to be optimistic and hope that whatever problem I ran into was somehow related to rolling back the frr packages, not having all parts of the system updated in sync, or maybe a bug that will be squashed by patching. Next steps will be to carefully roll the updates through all the nodes.

Thanks,

Erik
 
Do you have MAC learning disabled on your bridge via our configuration? Maybe there's an issue related to that? That's the only thing I could think of for now.
 
Hi,

I have already see this in production (on pve8), but I never found what was the problem. (So,I'm doing network config update carefully currently, with empty host, and reloading frr one by one).

I wonder if it couldn't be related to the fwbr bridges for iptables , where frr could have problem with listening netlink coming from a lot of bridges, and losting from messages. (I need to retest with nftables without fwbr to see).

It's really random, I can't reproduce it easily, and the only way to fix it was to restart/migrate vm to another host.
 
Do you have MAC learning disabled on your bridge via our configuration?
I'm unsure of what you are asking for. I googled "proxmox mac learning" and found references to setting bridge-disable-mac-learning in interfaces or interfaces.d/sdn. We have not done this.

I have already see this in production (on pve8), but I never found what was the problem. (So,I'm doing network config update carefully currently, with empty host, and reloading frr one by one).
This is what I was planning to do. However, I stumbled into a similar but different issue when trying to migrate VMs on to a host that I patched to current, including frr, pushed the sdn config, and rebooted a few days ago. The new issue is just as odd as the first issue I encountered and I'm still trying to sufficiently figure out what's happening so I can post it.

I wonder if it couldn't be related to the fwbr bridges for iptables , where frr could have problem with listening netlink coming from a lot of bridges, and losting from messages. (I need to retest with nftables without fwbr to see).
What would you consider a large number of bridges? When I first started looking into this problem, I noticed we had about 2000 evpn routes in bgp and searched for frr limits, but didn't find any. Our hosts have about 100 bridges from vnets alone.

It's really random, I can't reproduce it easily, and the only way to fix it was to restart/migrate vm to another host.
Unfortunately, I can reproduce this too often, but mostly on production systems where I can't take the time to troubleshoot.

I have hardware for a test environment, but it's currently in the production cluster because we needed additional capacity to migrate our vmware VMs and hosts to proxmox. I encountered this issue when attempting to add the final batch of hosts to production. Once that's done, I'll remove the test hardware from prod and be able to test and troubleshoot this issue in a safe place.

Thanks,

Erik