Using the Proxmox SDN to manage host connectivity with BGP

thecurseofrng

New Member
May 2, 2022
10
0
1
Paris, France
Hello,

We're currently trying out using the Proxmox SDN to manage not only guests (using the EVPN controller), but also host connectivity using BGP to experiment with interface redundancy using L3 protocols. It's a bit spotty of course since it obviously wasn't planned for that, but I wanted to give some feedback to see if anything could be changed to make it smoother, if the core idea is manageable or if it's absolutely out of the question and we should rather keep hosts and guests fully separated one way or another (we haven't really tried to see if two BGP daemons can cooperate on the same host, like installing bird alongside frr).

To give a bit of context, our IP fabric is using the CGNAT range (100.64.0.0/10), bare metal hosts rely on two network interfaces bearing each an underlay IP and the "main" overlay IP is configured on a dummy loopback. Each interface is wired to one of two TOR switches, which are also acting as BGP peers (the 100.79.10.158 and 100.79.10.190 IP below)

The ifupdown configuration looks like this:

Code:
auto lo
iface lo inet loopback


auto dummy1
iface dummy1 inet static
        address 100.79.18.1/32
        pre-up ip link add dummy1 type dummy
        post-up ip addr add 100.79.18.1/32 dev dummy1 # the address directive can be somewhat unreliable with interfaces you just created during pre-up
        post-up ip route add blackhole default metric 100

auto eno1
iface eno1 inet static
        address 100.79.10.129/27
        mtu 9000

auto eno2
iface eno2 inet static
        address 100.79.10.161/27
        mtu 9000

source /etc/network/interfaces.d/*

We started our tests using the Proxmox 7.2 SDN. This is the zone definition:

Code:
evpn: fmdc5
    controller fmdc5
    vrf-vxlan 80000
    ipam pve
    mac 00:03:20:00:F3:EE
    mtu 8950
    rt-import 64600:90007,64600:90008,64600:90009,64600:80000,64600:11200

And our current controller definitions

Code:
evpn: fmdc5
    asn 64620
    peers 100.79.10.158,100.79.10.190
   
bgp: bgppx1
    asn 64620
    node px1
    peers 100.79.10.158,100.79.10.190
    bgp-multipath-as-path-relax 0
    ebgp 1
    loopback dummy1

We noticed the EVPN controller code had provisions to load side config from frr.conf.local and ended up with this additional configuration adjustments:

Code:
!
router bgp 64620
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
  advertise-all-vni
 exit-address-family
exit
!
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
end

We also noticed the merge code was limited to the VRF part of the structure, and changed that to a full merge with the following patch:

Diff:
--- /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm    2022-04-27 10:33:13.000000000 +0200
+++ /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm    2022-09-05 14:06:33.404722426 +0200
@@ -388,11 +388,11 @@
     push @{$final_config}, "log syslog informational";
     push @{$final_config}, "service integrated-vtysh-config";
     push @{$final_config}, "!";

     if (-e "/etc/frr/frr.conf.local") {
-    generate_frr_recurse($final_config, $config->{frr}->{vrf}, "vrf", 1);
+    generate_frr_recurse($final_config, $config->{frr}, undef, 0);
     generate_frr_routemap($final_config, $config->{frr_routemap});
     push @{$final_config}, "!";

     my $local_conf = file_get_contents("/etc/frr/frr.conf.local");
     chomp ($local_conf);

A bit hackish but it did work aside from one big drawback I'll leave for later.

Yesterday, we took a look at Proxmox 7.3 and noticed the frr.conf.local merge was now a lot more extensive. Almost all of our local changes were now take into account of the box, except the two prefix-list at the end. We now ended up with the following patch:

Diff:
--- /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm    2022-11-24 16:54:06.307305735 +0100
+++ /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm    2022-11-25 10:11:01.699094885 +0100
@@ -479,10 +479,13 @@
         $routemap = undef;
         $routemap_action = undef;
         $routemap_config = ();
         }
         next;
+    } elsif ($line =~ m/^ip (.+)$/) {
+        push(@{$config->{frr}->{''}}, $line);
+        next;
     } elsif($line =~ m/!/) {
         next;
     }

     next if !$section;

Now this is still a bit hackish and I'm not really comfortable with keeping inhouse patchs like this. Reading the BgpPlugin.pm code, it seems it is the one responsible for adding other similar lines. Hence my first question, could it be possible to add a way in the BGP controller code to add our own prefix-list?

The second question is probably a bit more involved and cuts to the main drawback of our current system. As our additional configuration doesn't play well with the FRR reload code, a restart is mandatory, which of course ends up cutting the whole host connectivity each time we apply the SDN, and the guests along with it. Not only that, but as we're sending pvesh set /cluster/sdn to one of our Proxmox nodes, it's also running pvesh set /nodes/${NODE}/network on itself, cuts its own network and stop sending similar tasks to other nodes as they now are unreachable, which often ends up with the cluster in an inconsistent state and only part of the nodes with the latest configuration applied. Re-applying the whole SDN again doesn't work much better since it generates a new version number and it's basically Russian roulette time again. So, I'm wondering if the PUT /cluster/sdn API call could be made a bit more resilient and retry for a bit with a back-off mechanism rather than giving up immediately. Not necessarily by default, it could be gated behind an API parameter.
 
Last edited:
I've left out the vnet and subnets out of the first post since this is a bit out of topic and not terribly interesting, but here it is if anyone asks.

Vnets:

Code:
vnet: fminfra
    zone fmdc5
    tag 11200

Subnets:

Code:
subnet: fmdc5-100.76.1.0-24
    vnet fminfra
    gateway 100.76.1.254

The full and final FRR config file looks like this:

Code:
frr version 8.2.2
frr defaults datacenter
hostname px1
log syslog informational
service integrated-vtysh-config
!
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip protocol bgp route-map correct_src
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
vrf vrf_fmdc5
 vni 80000
exit-vrf
!
router bgp 64620
 bgp router-id 100.79.18.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 bgp disable-ebgp-connected-route-check
 neighbor BGP peer-group
 neighbor BGP remote-as external
 neighbor BGP bfd
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 neighbor VTEP peer-group
 neighbor VTEP remote-as 64620
 neighbor VTEP bfd
 neighbor 100.79.10.158 peer-group VTEP
 neighbor 100.79.10.190 peer-group VTEP
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 !
 address-family ipv4 unicast
  network 100.79.18.1/32
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  redistribute connected
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor VTEP route-map MAP_VTEP_IN in
  neighbor VTEP route-map MAP_VTEP_OUT out
  neighbor VTEP activate
  advertise-all-vni
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
  advertise-all-vni
 exit-address-family
exit
!
router bgp 64620 vrf vrf_fmdc5
 bgp router-id 100.79.18.1
 !
 address-family l2vpn evpn
  route-target import 64600:11200
  route-target import 64600:80000
  route-target import 64600:90007
  route-target import 64600:90008
  route-target import 64600:90009
 exit-address-family
exit
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
route-map correct_src permit 1
 match ip address prefix-list loopbacks_ips
 set src 100.79.18.1
exit
!
line vty
!
 
Hi, about frr.conf.local,

the code has been rewritten some months ago, to merge your frr.conf.local with the generated frr.conf.

https://git.proxmox.com/?p=pve-network.git;a=commit;h=78f249bcc8377436f0b5ccff0723d0464f588ad8

Previously, the frr.conf.local was fully used + the generated vrf section by proxmox.
(and don't have updated the doc yet sorry).

So, you can add prefix-list , route-map, what you want in your frr.local.
proxmox will parse && merge each section from local conf.


for example,
if proxmox generate conf is something like

Code:
frr version 8.2.2
frr defaults datacenter
hostname px1
log syslog informational
service integrated-vtysh-config
!
vrf vrf_fmdc5
 vni 80000
exit-vrf
!
router bgp 64620
 bgp router-id 100.79.18.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 bgp disable-ebgp-connected-route-check
 neighbor BGP peer-group
 neighbor BGP remote-as external
 neighbor BGP bfd
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP

and you add a frr.conf.local like

Code:
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip protocol bgp route-map correct_src
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
router bgp 64620
 neighbor 1.2.3.4 peer-group BGP


You should have a final frr.conf like

Code:
frr version 8.2.2
frr defaults datacenter
hostname px1
log syslog informational
service integrated-vtysh-config
!
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip protocol bgp route-map correct_src
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
vrf vrf_fmdc5
 vni 80000
exit-vrf
!
router bgp 64620
 bgp router-id 100.79.18.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 bgp disable-ebgp-connected-route-check
 neighbor BGP peer-group
 neighbor BGP remote-as external
 neighbor BGP bfd
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 neighbor 1.2.3.4 peer-group BGP
 
The second question is probably a bit more involved and cuts to the main drawback of our current system. As our additional configuration doesn't play well with the FRR reload code, a restart is mandatory, which of course ends up cutting the whole host connectivity each time we apply the SDN, and the guests along with it.
Something specific in your configuration don't allow reload ?
the current proxmox code, try to reload frr, it's working 85% of the time but sometime it's not working, frr crash and proxmox restart the frr service just after.
(I have see that with adding route-map to a previous conf without any route-map). I think it's a frr bug, with the frr parser not able apply the diff with vtysh.

Do you have some example of previous frr.conf and new frr.conf, with reload not working ?


Not only that, but as we're sending pvesh set /cluster/sdn to one of our Proxmox nodes, it's also running pvesh set /nodes/${NODE}/network on itself, cuts its own network and stop sending similar tasks to other nodes as they now are unreachable, which often ends up with the cluster in an inconsistent state and only part of the nodes with the latest configuration applied.
When you call /cluster/sdn on the node, this node will called itself all others nodes "pvesh set/nodes/${NODEs]/network", in a loop, once by once.
I'll verify the code, but I thinked that the calling node should call itself at the end of the loop, to avoid the case where the node have a network interruption.


Re-applying the whole SDN again doesn't work much better since it generates a new version number and it's basically Russian roulette time again. So, I'm wondering if the PUT /cluster/sdn API call could be made a bit more resilient and retry for a bit with a back-off mechanism rather than giving up immediately. Not necessarily by default, it could be gated behind an API parameter.
Yes, it need to be improve, maybe with a daemon running on the nodes itself to manage the reload.

if something bad happen, you can just reload the manually network on a specifc node, no need to call /cluster/sdn again.


Personnally, I'm running epvn in production, and I never had reload problem.

your config seem pretty basic, just some extra prefix-list, so maybe can you look to add just the needed extra options in frr.conf.local , and try the code angain.
 
Thanks for this answer!

About the prefix-list, I tried moving them to the beginning of the frr.conf.local file like you did but they're getting filtered out with the vanilla EvpnPlugin.pm anyway.

The modified frr.conf.local

Code:
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
router bgp 64620
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
  advertise-all-vni
 exit-address-family
exit

The final frr config file with vanilla code:

Code:
frr version 8.2.2
frr defaults datacenter
hostname px1
log syslog informational
service integrated-vtysh-config
!
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip protocol bgp route-map correct_src
!
vrf vrf_fmdc5
 vni 80000
exit-vrf
!
router bgp 64620
 bgp router-id 100.79.18.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 bgp disable-ebgp-connected-route-check
 neighbor BGP peer-group
 neighbor BGP remote-as external
 neighbor BGP bfd
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 neighbor VTEP peer-group
 neighbor VTEP remote-as 64620
 neighbor VTEP bfd
 neighbor 100.79.10.158 peer-group VTEP
 neighbor 100.79.10.190 peer-group VTEP
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 !
 address-family ipv4 unicast
  network 100.79.18.1/32
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  redistribute connected
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor VTEP route-map MAP_VTEP_IN in
  neighbor VTEP route-map MAP_VTEP_OUT out
  neighbor VTEP activate
  advertise-all-vni
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
  advertise-all-vni
 exit-address-family
exit
!
router bgp 64620 vrf vrf_fmdc5
 bgp router-id 100.79.18.1
 !
 address-family l2vpn evpn
  route-target import 64600:11200
  route-target import 64600:80000
  route-target import 64600:90007
  route-target import 64600:90008
  route-target import 64600:90009
 exit-address-family
exit
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
route-map correct_src permit 1
 match ip address prefix-list loopbacks_ips
 set src 100.79.18.1
exit
!
line vty

Notice how the two NO_ROUTE_BACK_TO_TOR prefix-list are missing. The two other ip directives fond in the final frr config file are added by BgpPlugin.pm when the dummy option is activated in this node BGP controller, it's not part of our own modifications, and hence not part of the frr.conf.local merge.
 
Last edited:
About the reload, I indeed noticed Proxmox tried to do one, but this is what I get if I try to run the reload command manually:

/usr/lib/frr/frr-reload.py --stdout --reload /etc/frr/frr.conf

Code:
2022-11-28 11:19:23,190  INFO: Called via "Namespace(input=None, reload=True, test=False, debug=False, log_level='info', stdout=True, pathspace=None, filename='/etc/frr/frr.conf', overwrite=False, bindir='/usr/bin', confdir='/etc/frr', rundir='/var/run/frr', vty_socket=None, daemon='', test_reset=False)"
2022-11-28 11:19:23,190  INFO: Loading Config object from file /etc/frr/frr.conf
2022-11-28 11:19:23,249  INFO: Loading Config object from vtysh show running
2022-11-28 11:19:23,316  INFO: Executed "no hostname px1"
2022-11-28 11:19:23,316  INFO: /var/run/frr/reload-PR7ENR.txt content
['router bgp 64620\n neighbor BGP remote-as external\n',
 'router bgp 64620\n neighbor VTEP peer-group\n',
 'router bgp 64620\n neighbor VTEP remote-as 64620\n',
 'router bgp 64620\n neighbor VTEP bfd\n',
 'router bgp 64620\n neighbor 100.79.10.158 peer-group VTEP\n',
 'router bgp 64620\n neighbor 100.79.10.190 peer-group VTEP\n',
 'router bgp 64620\n no neighbor VTEP peer-group\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_IN in\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_OUT out\n',
 'router bgp 64620\n address-family l2vpn evpn\n  neighbor VTEP activate\n',
 'hostname px1\n',
 'line vty\n']
% Cannot change the peer-group. Deconfigure first
line 14: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP

% Cannot change the peer-group. Deconfigure first
line 17: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP

% Specify remote-as or peer-group commands first
line 24: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in

% Specify remote-as or peer-group commands first
line 28: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out

% Specify remote-as or peer-group commands first
line 32: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate

2022-11-28 11:19:23,352 WARNING: frr-reload.py failed due to
vtysh (exec file) exited with status 13
2022-11-28 11:19:23,352  INFO: Loading Config object from vtysh show running
2022-11-28 11:19:23,390  INFO: /var/run/frr/reload-WKV0LO.txt content
['router bgp 64620\n neighbor VTEP peer-group\n',
 'router bgp 64620\n neighbor VTEP remote-as 64620\n',
 'router bgp 64620\n neighbor VTEP bfd\n',
 'router bgp 64620\n neighbor 100.79.10.158 peer-group VTEP\n',
 'router bgp 64620\n neighbor 100.79.10.190 peer-group VTEP\n',
 'router bgp 64620\n neighbor BGP remote-as 64600\n',
 'router bgp 64620\n no neighbor VTEP peer-group\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_IN in\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_OUT out\n',
 'router bgp 64620\n address-family l2vpn evpn\n  neighbor VTEP activate\n',
 'hostname px1\n',
 'line vty\n',
 'router bgp 64620\n neighbor BGP remote-as external\n',
 'router bgp 64620\n neighbor VTEP peer-group\n',
 'router bgp 64620\n neighbor VTEP remote-as 64620\n',
 'router bgp 64620\n neighbor VTEP bfd\n',
 'router bgp 64620\n neighbor 100.79.10.158 peer-group VTEP\n',
 'router bgp 64620\n neighbor 100.79.10.190 peer-group VTEP\n',
 'router bgp 64620\n no neighbor VTEP peer-group\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_IN in\n',
 'router bgp 64620\n'
 ' address-family l2vpn evpn\n'
 '  neighbor VTEP route-map MAP_VTEP_OUT out\n',
 'router bgp 64620\n address-family l2vpn evpn\n  neighbor VTEP activate\n',
 'hostname px1\n',
 'line vty\n']
% Cannot change the peer-group. Deconfigure first
line 11: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP

% Cannot change the peer-group. Deconfigure first
line 14: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP

% Specify remote-as or peer-group commands first
line 24: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in

% Specify remote-as or peer-group commands first
line 28: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out

% Specify remote-as or peer-group commands first
line 32: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate

% Cannot change the peer-group. Deconfigure first
line 51: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP

% Cannot change the peer-group. Deconfigure first
line 54: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP

% Specify remote-as or peer-group commands first
line 61: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in

% Specify remote-as or peer-group commands first
line 65: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out

% Specify remote-as or peer-group commands first
line 69: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate

2022-11-28 11:19:23,425 WARNING: frr-reload.py failed due to
vtysh (exec file) exited with status 13

I also tried reloading FRR using systemd, but it just silently kills FRR with the following logs:

Code:
Nov 28 11:09:30 px1 systemd[1]: Reloading FRRouting.
Nov 28 11:09:30 px1 watchfrr[2862624]: [NG1AJ-FP2TQ] Terminating on signal
Nov 28 11:09:31 px1 frrinit.sh[2863750]: Stopped watchfrr.
Nov 28 11:09:31 px1 watchfrr[2863767]: [T83RR-8SM5G] watchfrr 8.2.2 starting: vty@0
Nov 28 11:09:31 px1 watchfrr[2863767]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Nov 28 11:09:31 px1 frrinit.sh[2863750]: Started watchfrr.
Nov 28 11:09:31 px1 watchfrr[2863767]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Nov 28 11:09:31 px1 watchfrr[2863767]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Nov 28 11:09:31 px1 watchfrr[2863767]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
Nov 28 11:09:31 px1 watchfrr[2863767]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
Nov 28 11:09:31 px1 bgpd[2862650]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor 100.79.10.158 6/6 (Cease/Other Configuration Change) 0 bytes
Nov 28 11:09:31 px1 bgpd[2862650]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor 100.79.10.190 6/6 (Cease/Other Configuration Change) 0 bytes
Nov 28 11:09:31 px1 bgpd[2862650]: [VB567-F0EDJ] %ADJCHANGE: neighbor 100.79.10.158(Unknown) in vrf default Down BGP Notification send
Nov 28 11:09:31 px1 bgpd[2862650]: [VB567-F0EDJ] %ADJCHANGE: neighbor 100.79.10.190(Unknown) in vrf default Down BGP Notification send
Nov 28 11:09:31 px1 frrinit.sh[2863791]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863791]: line 14: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863791]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863791]: line 17: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863791]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863791]: line 24: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in
Nov 28 11:09:31 px1 frrinit.sh[2863791]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863791]: line 28: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out
Nov 28 11:09:31 px1 frrinit.sh[2863791]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863791]: line 32: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate
Nov 28 11:09:31 px1 zebra[2862643]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 bgpd[2862650]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 staticd[2862657]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 bfdd[2862660]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 11: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 14: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 24: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 28: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 32: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 51: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Cannot change the peer-group. Deconfigure first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 54: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 61: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 65: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out
Nov 28 11:09:31 px1 frrinit.sh[2863794]: % Specify remote-as or peer-group commands first
Nov 28 11:09:31 px1 frrinit.sh[2863794]: line 69: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate
Nov 28 11:09:31 px1 zebra[2862643]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 bgpd[2862650]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 staticd[2862657]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 bfdd[2862660]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 28 11:09:31 px1 systemd[1]: frr.service: Control process exited, code=exited, status=1/FAILURE
Nov 28 11:09:31 px1 watchfrr[2863767]: [NG1AJ-FP2TQ] Terminating on signal
Nov 28 11:09:31 px1 frrinit.sh[2863795]: Stopped watchfrr.
Nov 28 11:09:31 px1 zebra[2862643]: [VXKFG-8SJRV][EC 4043309121] Client 'bfd' encountered an error and is shutting down.
Nov 28 11:09:31 px1 staticd[2862657]: [MRN6F-AYZC4] Terminating on signal
Nov 28 11:09:31 px1 bgpd[2862650]: [ZW1GY-R46JE] Terminating on signal
Nov 28 11:09:31 px1 zebra[2862643]: [JPSA8-5KYEA] client 43 disconnected 0 bfd routes removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [S929C-NZR3N] client 43 disconnected 0 bfd nhgs removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [XVBTQ-5QTVQ] Terminating on signal
Nov 28 11:09:31 px1 zebra[2862643]: [JPSA8-5KYEA] client 28 disconnected 0 bgp routes removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [S929C-NZR3N] client 28 disconnected 0 bgp nhgs removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [JPSA8-5KYEA] client 31 disconnected 0 vnc routes removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [S929C-NZR3N] client 31 disconnected 0 vnc nhgs removed from the rib
Nov 28 11:09:31 px1 bgpd[2862650]: [YAF85-253AP][EC 100663299] buffer_write: write error on fd 15: Broken pipe
Nov 28 11:09:31 px1 bgpd[2862650]: [X6B3Y-6W42R][EC 100663302] zclient_send_message: buffer_write failed to zclient fd 15, closing
Nov 28 11:09:31 px1 bgpd[2862650]: [WVAM7-7ZYKQ][EC 33554499] sendmsg_nexthop: zclient_send_message() failed
Nov 28 11:09:31 px1 zebra[2862643]: [JPSA8-5KYEA] client 38 disconnected 0 static routes removed from the rib
Nov 28 11:09:31 px1 zebra[2862643]: [S929C-NZR3N] client 38 disconnected 0 static nhgs removed from the rib
Nov 28 11:09:31 px1 bgpd[2862650]: [WVAM7-7ZYKQ][EC 33554499] sendmsg_nexthop: zclient_send_message() failed
Nov 28 11:09:31 px1 bgpd[2862650]: [MNSF9-KVB43] _bfd_sess_send: BFD session 100.79.10.129 -> 100.79.10.158 interface eno1 VRF default(0) was not uninstalled
Nov 28 11:09:31 px1 bgpd[2862650]: [WVAM7-7ZYKQ][EC 33554499] sendmsg_nexthop: zclient_send_message() failed
Nov 28 11:09:31 px1 bgpd[2862650]: [MNSF9-KVB43] _bfd_sess_send: BFD session 100.79.10.161 -> 100.79.10.190 interface eno2 VRF default(0) was not uninstalled
Nov 28 11:09:31 px1 zebra[2862643]: [QS0NJ-H5QKJ] Zebra final shutdown
Nov 28 11:09:31 px1 frrinit.sh[2863806]: Stopped bfdd
Nov 28 11:09:31 px1 frrinit.sh[2863807]: Stopped staticd
Nov 28 11:09:31 px1 frrinit.sh[2863809]: Stopped zebra
Nov 28 11:09:31 px1 frrinit.sh[2863808]: Stopped bgpd
Nov 28 11:09:31 px1 frrinit.sh[2863807]: .
Nov 28 11:09:31 px1 frrinit.sh[2863806]: .
Nov 28 11:09:31 px1 frrinit.sh[2863809]: .
Nov 28 11:09:31 px1 frrinit.sh[2863808]: .
Nov 28 11:09:31 px1 systemd[1]: frr.service: Succeeded.
Nov 28 11:09:31 px1 systemd[1]: Reloaded FRRouting.

I guess it interprets a reload fail as a config error and just stops.
 
Notice how the two NO_ROUTE_BACK_TO_TOR prefix-list are missing. The two other ip directives fond in the final frr config file are added by BgpPlugin.pm when the dummy option is activated in this node BGP controller, it's not part of our own modifications, and hence not part of the frr.conf.local merge.
I'll try to reproduce on my side, maybe it's a bug in the parser.

Edit:

oh, it seem that only route-map && access-list are implemented. I need to add support for prefix-list. I'll try to fix that tomorrow.
 
Last edited:
About the reload, I indeed noticed Proxmox tried to do one, but this is what I get if I try to run the reload command manually:

/usr/lib/frr/frr-reload.py --stdout --reload /etc/frr/frr.conf

Nov 28 11:09:31 px1 systemd[1]: Reloaded FRRouting.[/CODE]

I guess it interprets a reload fail as a config error and just stops.
yes, some changed can't be reloaded. (proxmox try to use frr-reload.py, and if it's failing, it's restart frr.

In you example, I think it's because you have twice

neighbor 100.79.10.158 peer-group BGP
neighbor 100.79.10.190 peer-group BGP

I don't think you need to add it in frr.conf.local, as it's already generated by bgpplugin.

You really need to only add additionnal options in your frr.conf.local

something like (not tested)

Code:
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
router bgp 64620
 neighbor BGP remote-as 64600
 !
 address-family ipv4 unicast
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
 exit-address-family
 !
exit
 
I tried removing the two BGP directives from frr.conf.local and indeed they were redundant and added back anyway by BgpPlugin.pm. But that didn't change anything to the actual errors which are triggered by the VTEP peer-group, not the BGP one.

Hunting a few other redundant directives, I slimed down the frr.conf.local like this:

Code:
!
router bgp 64620
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
 exit-address-family
exit
!
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
end

Code:
frr version 8.2.2
frr defaults datacenter
hostname px1
log syslog informational
service integrated-vtysh-config
!
ip prefix-list loopbacks_ips seq 10 permit 0.0.0.0/0 le 32
ip protocol bgp route-map correct_src
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
vrf vrf_fmdc5
 vni 80000
exit-vrf
!
router bgp 64620
 bgp router-id 100.79.18.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 bgp disable-ebgp-connected-route-check
 neighbor BGP peer-group
 neighbor BGP remote-as external
 neighbor BGP bfd
 neighbor 100.79.10.158 peer-group BGP
 neighbor 100.79.10.190 peer-group BGP
 neighbor VTEP peer-group
 neighbor VTEP remote-as 64620
 neighbor VTEP bfd
 neighbor 100.79.10.158 peer-group VTEP
 neighbor 100.79.10.190 peer-group VTEP
 neighbor BGP remote-as 64600
 no neighbor VTEP peer-group
 !
 address-family ipv4 unicast
  network 100.79.18.1/32
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  redistribute connected
  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  neighbor BGP allowas-in 1
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor VTEP route-map MAP_VTEP_IN in
  neighbor VTEP route-map MAP_VTEP_OUT out
  neighbor VTEP activate
  advertise-all-vni
  neighbor BGP activate
  neighbor BGP soft-reconfiguration inbound
  neighbor BGP allowas-in 1
  neighbor BGP route-map MAP_VTEP_IN in
  neighbor BGP route-map MAP_VTEP_OUT out
 exit-address-family
exit
!
router bgp 64620 vrf vrf_fmdc5
 bgp router-id 100.79.18.1
 !
 address-family l2vpn evpn
  route-target import 64600:11200
  route-target import 64600:80000
  route-target import 64600:90007
  route-target import 64600:90008
  route-target import 64600:90009
 exit-address-family
exit
!
route-map MAP_VTEP_IN permit 1
exit
!
route-map MAP_VTEP_OUT permit 1
exit
!
route-map correct_src permit 1
 match ip address prefix-list loopbacks_ips
 set src 100.79.18.1
exit
!
line vty
!

And to summarize the errors raised by the reload code:

Code:
% Cannot change the peer-group. Deconfigure first
line 14: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.158 peer-group VTEP

% Cannot change the peer-group. Deconfigure first
line 17: Failure to communicate[13] to bgpd, line:  neighbor 100.79.10.190 peer-group VTEP

% Specify remote-as or peer-group commands first
line 24: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_IN in

% Specify remote-as or peer-group commands first
line 28: Failure to communicate[13] to bgpd, line:   neighbor VTEP route-map MAP_VTEP_OUT out

% Specify remote-as or peer-group commands first
line 32: Failure to communicate[13] to bgpd, line:   neighbor VTEP activate
 
Hi,
here a version patched deb, with prefix-list support.
https://mutulin1.odiso.net/libpve-network-perl_0.7.2_all.deb

can you test it ?


about your reload error,
why do you want to use " neighbor BGP" in l2vpn evpn section ?

BGP && EVPN group use same ip, I just you only add
"neighbor BGP remote-as 64600" previously, can't you simply add "neighbor EVPN remote-as 64600" too ?


BTW, if you enabled the "ebgp" option on bgp controller,
you should have a generated "neighbor BGP remote-as external" && "neighbor EVPN remote-as external", which should work with any remote-as (like a wildcard).
 
Hi,

Sorry for the time it took to answer, I had a lot of work and had to do some back and forth with our network team before I could write something of interest.

First, I would like to thank you for the modified package, it works as is and the prefix list is now kept without any patch. Also, the neighbor BGP remote-as we added in frr.conf.local was indeed superfluous, as neighbor BGP remote-as external from your own code worked fine as a replacement and it's one less variable to bother with for us.

Now, for the reload issue, we actually noticed while looking at FRR that the running config had a few differences with the on-disk one.

Diff:
--- frr.conf    2022-12-12 15:11:20.116972424 +0100
+++ frr.conf.running    2022-12-12 15:27:31.357939844 +0100
@@ -1,53 +1,43 @@
+!
 frr version 8.2.2
 frr defaults datacenter
 hostname px1
 log syslog informational
 service integrated-vtysh-config
 !
-ip protocol bgp route-map correct_src
-!
 vrf vrf_fmdc5
  vni 80000
 exit-vrf
 !
 router bgp 64620
  bgp router-id 100.79.18.1
  no bgp default ipv4-unicast
- coalesce-time 1000
  bgp disable-ebgp-connected-route-check
+ coalesce-time 1000
  neighbor BGP peer-group
  neighbor BGP remote-as external
  neighbor BGP bfd
  neighbor 100.79.10.158 peer-group BGP
  neighbor 100.79.10.190 peer-group BGP
- neighbor VTEP peer-group
- neighbor VTEP remote-as 64620
- neighbor VTEP bfd
- neighbor 100.79.10.158 peer-group VTEP
- neighbor 100.79.10.190 peer-group VTEP
- no neighbor VTEP peer-group
  !
  address-family ipv4 unicast
   network 100.79.18.1/32
+  redistribute connected
   neighbor BGP activate
   neighbor BGP soft-reconfiguration inbound
-  redistribute connected
-  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
   neighbor BGP allowas-in 1
+  neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
  exit-address-family
  !
  address-family l2vpn evpn
-  neighbor VTEP route-map MAP_VTEP_IN in
-  neighbor VTEP route-map MAP_VTEP_OUT out
-  neighbor VTEP activate
-  advertise-all-vni
   neighbor BGP activate
   neighbor BGP soft-reconfiguration inbound
   neighbor BGP allowas-in 1
   neighbor BGP route-map MAP_VTEP_IN in
   neighbor BGP route-map MAP_VTEP_OUT out
+  advertise-all-vni
  exit-address-family
 exit
 !
 router bgp 64620 vrf vrf_fmdc5
  bgp router-id 100.79.18.1
@@ -74,7 +64,8 @@
 route-map correct_src permit 1
  match ip address prefix-list loopbacks_ips
  set src 100.79.18.1
 exit
 !
-line vty
+ip protocol bgp route-map correct_src
 !
+end

The main catch is, all the VTEP neighbors have been discarded by FRR when it started, and only the BGP ones remain. We now have a single BGP session with each TOR switch, which is also sending the EVPN routes, and is what we intended to do, not an accident. This works, but we now have a major difference between the running and on-disk configuration, which blows up at reload times when it tries to reconcile them. If I dump the running configuration to disk and reload, the reload from the systemd unit now works as intended instead of killing FRR.

It seems in our case, the VTEP neighbors are actually unnecessary as they're superseded by the BGP ones. Which brings a question, could it be possible to only have the BGP neighbors written to configuration in this kind of case, maybe trough a SDN configuration variable?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!