Hello,
We're currently trying out using the Proxmox SDN to manage not only guests (using the EVPN controller), but also host connectivity using BGP to experiment with interface redundancy using L3 protocols. It's a bit spotty of course since it obviously wasn't planned for that, but I wanted to give some feedback to see if anything could be changed to make it smoother, if the core idea is manageable or if it's absolutely out of the question and we should rather keep hosts and guests fully separated one way or another (we haven't really tried to see if two BGP daemons can cooperate on the same host, like installing bird alongside frr).
To give a bit of context, our IP fabric is using the CGNAT range (100.64.0.0/10), bare metal hosts rely on two network interfaces bearing each an underlay IP and the "main" overlay IP is configured on a dummy loopback. Each interface is wired to one of two TOR switches, which are also acting as BGP peers (the 100.79.10.158 and 100.79.10.190 IP below)
The ifupdown configuration looks like this:
We started our tests using the Proxmox 7.2 SDN. This is the zone definition:
And our current controller definitions
We noticed the EVPN controller code had provisions to load side config from frr.conf.local and ended up with this additional configuration adjustments:
We also noticed the merge code was limited to the VRF part of the structure, and changed that to a full merge with the following patch:
A bit hackish but it did work aside from one big drawback I'll leave for later.
Yesterday, we took a look at Proxmox 7.3 and noticed the frr.conf.local merge was now a lot more extensive. Almost all of our local changes were now take into account of the box, except the two prefix-list at the end. We now ended up with the following patch:
Now this is still a bit hackish and I'm not really comfortable with keeping inhouse patchs like this. Reading the BgpPlugin.pm code, it seems it is the one responsible for adding other similar lines. Hence my first question, could it be possible to add a way in the BGP controller code to add our own prefix-list?
The second question is probably a bit more involved and cuts to the main drawback of our current system. As our additional configuration doesn't play well with the FRR reload code, a restart is mandatory, which of course ends up cutting the whole host connectivity each time we apply the SDN, and the guests along with it. Not only that, but as we're sending
We're currently trying out using the Proxmox SDN to manage not only guests (using the EVPN controller), but also host connectivity using BGP to experiment with interface redundancy using L3 protocols. It's a bit spotty of course since it obviously wasn't planned for that, but I wanted to give some feedback to see if anything could be changed to make it smoother, if the core idea is manageable or if it's absolutely out of the question and we should rather keep hosts and guests fully separated one way or another (we haven't really tried to see if two BGP daemons can cooperate on the same host, like installing bird alongside frr).
To give a bit of context, our IP fabric is using the CGNAT range (100.64.0.0/10), bare metal hosts rely on two network interfaces bearing each an underlay IP and the "main" overlay IP is configured on a dummy loopback. Each interface is wired to one of two TOR switches, which are also acting as BGP peers (the 100.79.10.158 and 100.79.10.190 IP below)
The ifupdown configuration looks like this:
Code:
auto lo
iface lo inet loopback
auto dummy1
iface dummy1 inet static
address 100.79.18.1/32
pre-up ip link add dummy1 type dummy
post-up ip addr add 100.79.18.1/32 dev dummy1 # the address directive can be somewhat unreliable with interfaces you just created during pre-up
post-up ip route add blackhole default metric 100
auto eno1
iface eno1 inet static
address 100.79.10.129/27
mtu 9000
auto eno2
iface eno2 inet static
address 100.79.10.161/27
mtu 9000
source /etc/network/interfaces.d/*
We started our tests using the Proxmox 7.2 SDN. This is the zone definition:
Code:
evpn: fmdc5
controller fmdc5
vrf-vxlan 80000
ipam pve
mac 00:03:20:00:F3:EE
mtu 8950
rt-import 64600:90007,64600:90008,64600:90009,64600:80000,64600:11200
And our current controller definitions
Code:
evpn: fmdc5
asn 64620
peers 100.79.10.158,100.79.10.190
bgp: bgppx1
asn 64620
node px1
peers 100.79.10.158,100.79.10.190
bgp-multipath-as-path-relax 0
ebgp 1
loopback dummy1
We noticed the EVPN controller code had provisions to load side config from frr.conf.local and ended up with this additional configuration adjustments:
Code:
!
router bgp 64620
neighbor BGP remote-as 64600
no neighbor VTEP peer-group
neighbor 100.79.10.158 peer-group BGP
neighbor 100.79.10.190 peer-group BGP
!
address-family ipv4 unicast
redistribute connected
neighbor BGP soft-reconfiguration inbound
neighbor BGP prefix-list NO_ROUTE_BACK_TO_TOR out
neighbor BGP allowas-in 1
exit-address-family
!
address-family l2vpn evpn
neighbor BGP activate
neighbor BGP soft-reconfiguration inbound
neighbor BGP allowas-in 1
neighbor BGP route-map MAP_VTEP_IN in
neighbor BGP route-map MAP_VTEP_OUT out
advertise-all-vni
exit-address-family
exit
!
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 10 permit 100.76.0.0/16 ge 32
ip prefix-list NO_ROUTE_BACK_TO_TOR seq 20 permit 100.79.0.0/16 ge 32
!
end
We also noticed the merge code was limited to the VRF part of the structure, and changed that to a full merge with the following patch:
Diff:
--- /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm 2022-04-27 10:33:13.000000000 +0200
+++ /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm 2022-09-05 14:06:33.404722426 +0200
@@ -388,11 +388,11 @@
push @{$final_config}, "log syslog informational";
push @{$final_config}, "service integrated-vtysh-config";
push @{$final_config}, "!";
if (-e "/etc/frr/frr.conf.local") {
- generate_frr_recurse($final_config, $config->{frr}->{vrf}, "vrf", 1);
+ generate_frr_recurse($final_config, $config->{frr}, undef, 0);
generate_frr_routemap($final_config, $config->{frr_routemap});
push @{$final_config}, "!";
my $local_conf = file_get_contents("/etc/frr/frr.conf.local");
chomp ($local_conf);
A bit hackish but it did work aside from one big drawback I'll leave for later.
Yesterday, we took a look at Proxmox 7.3 and noticed the frr.conf.local merge was now a lot more extensive. Almost all of our local changes were now take into account of the box, except the two prefix-list at the end. We now ended up with the following patch:
Diff:
--- /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm 2022-11-24 16:54:06.307305735 +0100
+++ /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm 2022-11-25 10:11:01.699094885 +0100
@@ -479,10 +479,13 @@
$routemap = undef;
$routemap_action = undef;
$routemap_config = ();
}
next;
+ } elsif ($line =~ m/^ip (.+)$/) {
+ push(@{$config->{frr}->{''}}, $line);
+ next;
} elsif($line =~ m/!/) {
next;
}
next if !$section;
Now this is still a bit hackish and I'm not really comfortable with keeping inhouse patchs like this. Reading the BgpPlugin.pm code, it seems it is the one responsible for adding other similar lines. Hence my first question, could it be possible to add a way in the BGP controller code to add our own prefix-list?
The second question is probably a bit more involved and cuts to the main drawback of our current system. As our additional configuration doesn't play well with the FRR reload code, a restart is mandatory, which of course ends up cutting the whole host connectivity each time we apply the SDN, and the guests along with it. Not only that, but as we're sending
pvesh set /cluster/sdn
to one of our Proxmox nodes, it's also running pvesh set /nodes/${NODE}/network
on itself, cuts its own network and stop sending similar tasks to other nodes as they now are unreachable, which often ends up with the cluster in an inconsistent state and only part of the nodes with the latest configuration applied. Re-applying the whole SDN again doesn't work much better since it generates a new version number and it's basically Russian roulette time again. So, I'm wondering if the PUT /cluster/sdn
API call could be made a bit more resilient and retry for a bit with a back-off mechanism rather than giving up immediately. Not necessarily by default, it could be gated behind an API parameter.
Last edited: