Networking somewhat broken after upgrade (BFD + BGP)

Dec 3, 2024
5
2
3
After upgrading to 8.4.1, BFD and BGP seem to stop working for some reason or other. BFD constantly says the link is DOWN, even though it should be up. One machine on each cluster is affected. Disabling BFD seems to solve the issue. It affects *all* connections on each affected machine. I don't understand why it's broken. We upgraded from FRR 8.X to FRR 10.X

Configuration:
router bgp 4280323229
bgp router-id 10.1.1.3
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
bgp deterministic-med
bgp bestpath as-path multipath-relax
bgp bestpath compare-routerid
timers bgp 3 9
neighbor haqua_default peer-group
neighbor haqua_default remote-as external
neighbor haqua_default bfd
neighbor haqua_default capability extended-nexthop
neighbor sf_fw peer-group
neighbor sf_fw remote-as external
neighbor sf_fw bfd
neighbor sf_fw capability extended-nexthop
neighbor sl_fw peer-group
neighbor sl_fw remote-as external
neighbor sl_fw bfd
neighbor sl_fw capability extended-nexthop
neighbor underlay peer-group
neighbor underlay remote-as external
neighbor underlay capability extended-nexthop
neighbor vlan345 interface peer-group haqua_default
neighbor 2a13:2142:1:9::f1 peer-group sf_fw
neighbor 2a13:2142:1:9::f2 peer-group sf_fw
neighbor 2a13:2142:1:9::f3 peer-group sf_fw
neighbor vlan108 interface peer-group sl_fw
neighbor ens1f0np0 interface peer-group underlay
neighbor ens1f1np1 interface peer-group underlay
neighbor ens1f3np3 interface peer-group underlay
!
address-family ipv4 unicast
redistribute connected route-map loopback
neighbor sf_fw activate
neighbor sf_fw prefix-list default_only in
neighbor sf_fw prefix-list lo out
neighbor sl_fw activate
neighbor sl_fw prefix-list to_sl_fw_adv in
neighbor sl_fw prefix-list default_adv out
neighbor underlay activate
exit-address-family
!
address-family ipv6 unicast
redistribute connected
neighbor haqua_default activate
neighbor haqua_default prefix-list avernus_adv in
neighbor haqua_default prefix-list default_adv out
neighbor sf_fw activate
neighbor sf_fw prefix-list to_sf_fw_adv in
neighbor sf_fw prefix-list default_adv out
neighbor sl_fw activate
neighbor sl_fw prefix-list to_sl_fw_adv in
neighbor sl_fw prefix-list default_adv out
neighbor underlay activate
exit-address-family
!
address-family l2vpn evpn
neighbor underlay activate
advertise-all-vni
vni 261
rd 1103:261
exit-vni
advertise-svi-ip
exit-address-family
exit
!

Underlay is what's affected.

A short log thing:

```
2025-04-14T19:09:48.882141+02:00 shiki3 watchfrr[23394]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
2025-04-14T19:09:48.882194+02:00 shiki3 watchfrr[23394]: [QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
2025-04-14T19:09:48.882242+02:00 shiki3 watchfrr[23394]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
2025-04-14T19:09:48.882269+02:00 shiki3 watchfrr[23394]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
2025-04-14T19:09:48.882292+02:00 shiki3 watchfrr[23394]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
2025-04-14T19:09:48.882318+02:00 shiki3 watchfrr[23394]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
2025-04-14T19:09:49.606852+02:00 shiki3 zebra[23406]: [V98V0-MTWPF] client 51 says hello and bids fair to announce only bgp routes vrf=0
2025-04-14T19:09:52.994648+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:09:52.994817+02:00 shiki3 bgpd[23413]: [H4B4J-DCW2R][EC 33554455] ens1f3np3 [Error] bgp_read_packet error: Connection reset by peer
2025-04-14T19:09:55.183662+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:09:56.221385+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:09.451370+02:00 shiki3 zebra[23406]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.455245+02:00 shiki3 watchfrr[23394]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.456423+02:00 shiki3 bfdd[23423]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.505545+02:00 shiki3 mgmtd[23411]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.778039+02:00 shiki3 bgpd[23413]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.852405+02:00 shiki3 zebra[23406]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.856607+02:00 shiki3 watchfrr[23394]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.857741+02:00 shiki3 bfdd[23423]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:09.904466+02:00 shiki3 mgmtd[23411]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:10.021029+02:00 shiki3 bgpd[23413]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-04-14T19:10:10.081139+02:00 shiki3 watchfrr[23394]: [WFP93-1D146] configuration write completed with exit code 0
2025-04-14T19:10:11.579500+02:00 shiki3 watchfrr[23394]: [WFP93-1D146] configuration write completed with exit code 0
2025-04-14T19:10:14.024161+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:14.024278+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:14.024309+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:14.024335+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:14.024360+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:14.024385+02:00 shiki3 bgpd[23413]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
2025-04-14T19:10:15.064907+02:00 shiki3 watchfrr[23394]: [NG1AJ-FP2TQ] Terminating on signal
2025-04-14T19:10:15.174976+02:00 shiki3 zebra[23406]: [N5M5Y-J5BPG][EC 4043309121] Client 'bfd' (session id 0) encountered an error and is shutting down.
2025-04-14T19:10:15.175152+02:00 shiki3 bgpd[23413]: [ZW1GY-R46JE] Terminating on signal
2025-04-14T19:10:15.175456+02:00 shiki3 mgmtd[23411]: [X3G8F-PM93W] BE-adapter: mgmt_msg_read: got EOF/disconnect
2025-04-14T19:10:15.175500+02:00 shiki3 zebra[23406]: [JPSA8-5KYEA] client 44 disconnected 0 bfd routes removed from the rib
2025-04-14T19:10:15.175525+02:00 shiki3 zebra[23406]: [S929C-NZR3N] client 44 disconnected 0 bfd nhgs removed from the rib
2025-04-14T19:10:15.175549+02:00 shiki3 mgmtd[23411]: [J2RAS-MZ95C] Terminating on signal
2025-04-14T19:10:15.175604+02:00 shiki3 zebra[23406]: [N5M5Y-J5BPG][EC 4043309121] Client 'static' (session id 0) encountered an error and is shutting down.
2025-04-14T19:10:15.175644+02:00 shiki3 zebra[23406]: [X3G8F-PM93W] BE-client: mgmt_msg_read: got EOF/disconnect
2025-04-14T19:10:15.175674+02:00 shiki3 zebra[23406]: [XVBTQ-5QTVQ] Terminating on signal
2025-04-14T19:10:15.176151+02:00 shiki3 zebra[23406]: [JPSA8-5KYEA] client 18 disconnected 58 bgp routes removed from the rib
2025-04-14T19:10:15.176204+02:00 shiki3 zebra[23406]: [S929C-NZR3N] client 18 disconnected 0 bgp nhgs removed from the rib
2025-04-14T19:10:15.176267+02:00 shiki3 bgpd[23413]: [YAF85-253AP][EC 100663299] buffer_write: write error on fd 15: Broken pipe
2025-04-14T19:10:15.176298+02:00 shiki3 bgpd[23413]: [X6B3Y-6W42R][EC 100663302] zclient_send_message: buffer_write failed to zclient fd 15, closing
2025-04-14T19:10:15.176335+02:00 shiki3 zebra[23406]: [JPSA8-5KYEA] client 32 disconnected 0 vnc routes removed from the rib
2025-04-14T19:10:15.176365+02:00 shiki3 zebra[23406]: [S929C-NZR3N] client 32 disconnected 0 vnc nhgs removed from the rib
2025-04-14T19:10:15.176411+02:00 shiki3 zebra[23406]: [JPSA8-5KYEA] client 39 disconnected 0 static routes removed fro```
 
Seems like there are issue with BFD being flakey in 10.2.1, I was able to reproduce this on my testcluster as well. What worked for me was resetting bfd as follows on the nodes where the errors occured:

Code:
$ vtysh
(vtysh) conf t
(vtysh) router bgp <asn>

# for each neighbor using bfd:
(vtysh) no neighbor <name> bfd
(vtysh) neighbor <name> bfd

This seems to be fixed in 10.2.2, we'll see if we can get this version on testing soon.