When setting up the EVPN SDN a few weeks back, I encountered an issue: Selecting more than one exit node for a zone broke all external connectivity, regardless of whether a primary node was selected, and regardless of whether SNAT was turned on. (And yes, I set my rp_filters correctly.)
A bit of research confirmed that others had encountered the same issue; See here and and here.
I looked into it, found the bug, and have thoroughly verified the fix on the latest release of libpve-network-perl (0.9.5).
The problem:
Original (/usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm)
Fixed
If you want to do some quick testing, I also scripted out the updates with perl. Note that I'm also changing get_standard_option('pve-node') to get_standard_option('pve-node-list'); In the latest release of PVE, you're unable to not select a primary exit node (in the UI at least) unless you make this change.
I'm new to the forum and unsure how to tag members. I'd like to pull in spirit, who I know has been very active on the SDN project. If anyone can help me out there, I'd appreciate it.
I found several other (smaller) bugs, and I have some additional suggestions - things I've implemented in my own lab that I believe make the user experience easier and/or nicer. If all goes well, with this bugfix, hoping to work with the team on the others as well.
A bit of research confirmed that others had encountered the same issue; See here and and here.
I looked into it, found the bug, and have thoroughly verified the fix on the latest release of libpve-network-perl (0.9.5).
The problem:
- By advertising a default route on all gateway nodes, packets are guaranteed to loop until TTL death
- VXLAN interfaces only tunnel packets across vrfs when no entries are matched in the forwarding table
- (See https://www.kernel.org/doc/Documentation/networking/vxlan.txt)
- If multiple nodes are advertising default routes, then no node will not have a default route, and nodes will always prefer the default route over popping the packet across the vxlan interface, leading to packet looping
- ONLY 'default-originate' if the node is a "primary exit node"
- This allows multiple exit nodes to function properly with or without a primary exit node; It also allows SNAT to work with multiple exit nodes with or without a primary exit node. I've seen some of the SDN developers claim SNAT "requires" a primary exit-node - that is absolutely not true, it is working in my lab after fixing the actual root cause.
Original (/usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm)
Perl:
@controller_config = ();
#add default originate to announce 0.0.0.0/0 type5 route in evpn
push @controller_config, "default-originate ipv4";
push @controller_config, "default-originate ipv6";
push(@{$config->{frr}->{router}->{"bgp $asn vrf $vrf"}->{"address-family"}->{"l2vpn evpn"}}, @controller_config);
Fixed
Perl:
if ($exitnodes_primary eq $local_node) {
@controller_config = ();
#add default originate to announce 0.0.0.0/0 type5 route in evpn
push @controller_config, "default-originate ipv4";
push @controller_config, "default-originate ipv6";
push(@{$config->{frr}->{router}->{"bgp $asn vrf $vrf"}->{"address-family"}->{"l2vpn evpn"}}, @controller_config);
}
If you want to do some quick testing, I also scripted out the updates with perl. Note that I'm also changing get_standard_option('pve-node') to get_standard_option('pve-node-list'); In the latest release of PVE, you're unable to not select a primary exit node (in the UI at least) unless you make this change.
Bash:
perl -i -pe "s/\'exitnodes-primary\' => get_standard_option\(\'pve-node\'/\'exitnodes-primary\' => get_standard_option\(\'pve-node-list\'/" /usr/share/perl5/PVE/Network/SDN/Zones/EvpnPlugin.pm;
perl -i -p0e 's/^*\s\@controller_config = \(\);\s*\#add default originate to announce 0.0.0.0\/0 type5 route in evpn\s*push \@controller_config, "default-originate ipv4";\s*push \@controller_config, "default-originate ipv6";\s*push\(\@\{\$config->\{frr\}->\{router\}->\{"bgp \$asn vrf \$vrf"\}->\{"address-family"\}->\{"l2vpn evpn"\}\}, \@controller_config\);/\tif \(\$exitnodes_primary eq \$local_node\) \{\n\t\t\@controller_config = \(\);\n\t\t\#add default originate to announce 0.0.0.0\/0 type5 route in evpn\n\t\tpush \@controller_config, "default-originate ipv4";\n\t\tpush \@controller_config, "default-originate ipv6";\n\t\tpush\(\@\{\$config->\{frr\}->\{router\}->\{"bgp \$asn vrf \$vrf"\}->\{"address-family"\}->\{"l2vpn evpn"\}\}, \@controller_config\);\n\t\}/s' /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm;
systemctl restart pveproxy.service pvedaemon.service;
I'm new to the forum and unsure how to tag members. I'd like to pull in spirit, who I know has been very active on the SDN project. If anyone can help me out there, I'd appreciate it.
I found several other (smaller) bugs, and I have some additional suggestions - things I've implemented in my own lab that I believe make the user experience easier and/or nicer. If all goes well, with this bugfix, hoping to work with the team on the others as well.