EVPN Multiple Tenants With Overlapping Subnets

janus · Feb 10, 2024

Hello,

I've been playing with EVPN SDN on Proxmox VE 8.1.4 and I love it so far, but there are a few things that got me scratching my head.

I apologize in advance if this topic has already been discussed, but the forum is just too huge to comb it all.

Anway, I got a 2 node cluster and I've been testing SDN between the nodes, first with one evpn zone, which works fine (except the routing loop when using multiple exit nodes, which has been discussed before). Then, I wanted to test with another evpn zone alongside the first one ...

The way current implementation works, when exit nodes are used, vnet subnets are:

a) exported from BGP to kernel main routing table, if exit node local routing is not enabled
b) set as static routes in the main routing table, using veth interface pair between vrf and the main routing table as a return path to the vrf, if exit node local routing is enabled

In either case, the very fact that vrf routes are leaked into the main routing table makes it impossible to have multiple tenants with overlapping subnets. This is ok if tenants use public adresses which are unique or you manage the tenants' address plan and you can divide private subnets between tenants, but in case you want to let tenants manage their own address spaces, you're in trouble. When I added the same subnet to both evpn zones, Proxmox managed to add/leak the route only for the first zone I added the subnet to, as expected. As a result, only the first zone had functional outside connectivity.

Of course, this is easily solvable by not enabling exit nodes at all and provisioning a gateway container/VM, even an appliance VM like vyOS/MikroTik/pfSense/opnSense, with one interface connected to the main routing table and the other connected to the vrf. But this is beside the point. My goal is to achieve SNAT on exit nodes without leaking vnet subnets into the main routing table and thus allowing tenants to use whatever addresses they want, even overlapping ones - without external elements. It goes without saying that this model implies complete isolation between tenants at all times, which may not be suitable for everyone, but could be useful to (small) private cloud providers.

Manually, I can do this by enabling a single exit node with local routing enabled, which would provision veth pair between the main routing table and the vrf, and then manually removing vnet routes. Next, I've noticed that veth pairs in both evpn zones use the same private subnet 10.255.255.0/30, which is a topic for another discussion, so I manually changed one to 10.255.255.4/30 on the veth pair in the second evpn zone, with .5 being on the "outside" interface (xvrf_<zone>) and .6 being on the "inside" interface (xvfrp_<zone>). Lastly, I added the default route pointing to 10.255.255.5 on the outside end of the veth pair to the vrf. Now the vrf traffic can directly reach the host, but cannot return, because I have removed the vnet routes from the main table.

This is where it gets ugly. In order to avoid adding any routes to the host, especially vnet ones, I had to do double masquerade:

1) first, when traffic exits the vrf via xvrfp_ interface and enters the veth pipe, in order to translate it's source address into something that is already known to the host via connected route on xvrf_ interface (because both xvfr_ and xvfrp_ are in the same /30 subnet).
2) second, when traffic, exiting the veth pipe and entering the host via xvrf_ interface, needs to reach anything outside the host itself, by masqueading it into host's lan ip

Code:

iptables -t nat -I POSTROUTING -o xvrfp_+ -j MASQUERADE
iptables -t mangle -I PREROUTING -i xvrf_+ -j MARK --set-mark 123
iptables -t nat -I POSTROUTING -m mark --mark 123 -o <hosts lan interface> -j MASQUERADE

It gets worse if host's lan ip is also a private address. Then the internet gateway on the network's edge will do yet another SNAT, but, at least it works.

On the other hand, if /30 subnets, assigned to veth pairs automatically, could be manually configured via Proxmox interface/API, a public subnet could be assigned, making the second masquerade unnecessary. Of course, this would require advertising those public subnets from the Proxmox host to the rest of the network via BGP (recommended) or IS-IS/OSPF (not recommended) or setting static routes for those subnets on a device Proxmox host uses as the default gateway (not ideal, but a must if no routing protocol are in use network-wise).

My questions for you folks are:

1. Has anyone had the same problem/requirements or am I the only one that needs this ?
2. Does anyone have a better idea/more elegant solution for this, in general ?
3. Does this scenario make enough sense to make it's way into the official Proxmox VE at some point, as one of the evpn options ?

spirit · Feb 13, 2024

Hi

Next, I've noticed that veth pairs in both evpn zones use the same private subnet 10.255.255.0/30, which is a topic for another discussion
...
first, when traffic exits the vrf via xvrfp_ interface and enters the veth pipe, in order to translate it's source address into something that is already known to the host via connected route on xvrf_ interface (because both xvfr_ and xvfrp_ are in the same /30 subnet)..
..

This interface only exist if you have enable "local node routing". This is not needed, until you want to access to a vm, from the host node ip.
This was a feature request from a forum user for a specific usecase.
for snat, or exit routing, you don't need this option.

My goal is to achieve SNAT on exit nodes without leaking vnet subnets into the main routing table and thus allowing tenants to use whatever addresses they want, even overlapping ones - without external elements. It

About SNAT, I'm going to implement it cleanly soon (with a specific deamon), because i'm really limited with current post-up scripts.

I think I have talked about this in the forum with another user

Also a possible implementation here:

https://www.linuxquestions.org/questions/linux-newbie-8/iptables-nat-for-vrf-4175721876/

Code:

mangle table:
iptables -t mangle -A PREROUTING -i {vrfA_incmoing_intf} -s {private_ip} -d 0.0.0.0/0 -j CONNMARK --set-mark 10
iptables -t mangle -A PREROUTING -s 0.0.0.0/0 -d {vrfA_server_public_ip} -j CONNMARK --set-mark 11
Same for VRFB but using two different connmark values
iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark

NAT table:
iptables -t nat -A PREROUTING -m connmark --mark 11 -j DNAT --to-destination {private_ip}
iptables -t nat -A POSTROUTING -m connmark --mark 10 -j SNAT --to-source {server_public_ip}

ip rules:
ip rule add fwmark 10 lookup {vrfA_table_id}
ip rule add fwmark 11 lookup {vrfA_table_id}

Edit:

https://forum.proxmox.com/threads/d...-subnets-in-multiple-zones.134187/post-593204

janus · Feb 14, 2024

spirit said:
Hi

This interface only exist if you have enable "local node routing". This is not needed, until you want to access to a vm, from the host node ip.
This was a feature request from a forum user for a specific usecase.
for snat, or exit routing, you don't need this option.

About SNAT, I'm going to implement it cleanly soon (with a specific deamon), because i'm really limited with current post-up scripts.

I think I have talked about this in the forum with another user

Also a possible implementation here:

https://www.linuxquestions.org/questions/linux-newbie-8/iptables-nat-for-vrf-4175721876/

Code:

mangle table: iptables -t mangle -A PREROUTING -i {vrfA_incmoing_intf} -s {private_ip} -d 0.0.0.0/0 -j CONNMARK --set-mark 10 iptables -t mangle -A PREROUTING -s 0.0.0.0/0 -d {vrfA_server_public_ip} -j CONNMARK --set-mark 11 Same for VRFB but using two different connmark values iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark NAT table: iptables -t nat -A PREROUTING -m connmark --mark 11 -j DNAT --to-destination {private_ip} iptables -t nat -A POSTROUTING -m connmark --mark 10 -j SNAT --to-source {server_public_ip} ip rules: ip rule add fwmark 10 lookup {vrfA_table_id} ip rule add fwmark 11 lookup {vrfA_table_id}

Edit:

https://forum.proxmox.com/threads/d...-subnets-in-multiple-zones.134187/post-593204

Hi Spirit,

ah, yes, I forgot about CONNMARK ...

I get it. Makes perfect sense ... in a typical Linux convoluted fashion

It solves the problem of return traffic, which I've been trying to solve with veth pair.

So, for SNAT:

1. match internet-bound packets coming from vnet interface and set mark on the resulting conntrack entry
2. outgoing packets are routed via implicit l3mdev rule, because vrf interface is set as a master for all vnet interfaces
3. first packet matching the marked conntrack entry hits the SNAT rule and updates the conntrack entry so that subsequent packets are also SNATed
4. set packet mark on all packets coming from outside that match marked conntrack entry (return packets) with --restore-mark (could also exclude vnet interface here, to save some processing, since you need restore-mark in opposite direction)
5. return packets that got MARKed with --restore-mark have to be routed using vrf table in order to reach VMs connected to vnet, but they don't match implicit l3mdev rule, because inbound interface is not slave to the vrf interface, therefore, we need ip rule add fwmark <mark> lookup <vrf_table>
6. send return packets via vnet interface

It also works if you only need masquerading, in which case you don't need the DNAT part of your example and you don't need individual rules for each VM, but, yes, for VPS/cloud providers, VMs inside VRFs would definitely have to be accessible from outside, which requires DNAT as well.

Great stuff. Thank you!

spirit · Feb 14, 2024

janus said:
It also works if you only need masquerading, in which case you don't need the DNAT part of your example and you don't need individual rules for each VM, but, yes, for VPS/cloud providers, VMs inside VRFs would definitely have to be accessible from outside, which requires DNAT as well.

Great stuff. Thank you!

If you are able to test, and if you have a working configuration (without dnat, with dnat,..), please share it. (I don't have too much time to test it).

I'll help me to implement it later.

I'm a bit busy with dhcp/ipam currently, but the nat daemon should be the next thing to implement.

janus · Feb 14, 2024

spirit said:
If you are able to test, and if you have a working configuration (without dnat, with dnat,..), please share it. (I don't have too much time to test it).

I'll help me to implement it later.

I'm a bit busy with dhcp/ipam currently, but the nat daemon should be the next thing to implement.

Hi Spirit,

your example works just fine, it just implements full 2-way NAT for each VM, which may not always be needed and certanly goes beyond what EVPN SNAT option is supposed to do. Meaning, you can both initiate connection from a VM to the outside world and from the outside world to the VM, which is the best solution, but requires a public IP + DNAT rule, per VM, which is too much to configure if you only need VM-initiated traffic and the VM is not exposed to the outside world (which would be the equivalent to the existing EVPN SNAT option, but without leaking VRF routes to the main routing table) . In that case, you only need simple masquerading:

Code:

iptables -t mangle -I PREROUTING -i <vnet> -s <subnet> -j CONNMARK --set-mark <m>
iptables -t mangle -I PREROUTING ! -i <vnet> -j CONNMARK --restore-mark

iptables -t nat -I POSTROUTING -o <host's lan iface> -m connmark --mark <m> -j MASQUERADE

ip rule add fwmark <m> lookup <vrf table>

That's it. This is basically the SNAT half of your example. It allows VMs connected to <vnet> to initiate connections to the outside world using host's LAN address and, of course, for their traffic to return, but not the other way around.

Having DNAT as an option would be nice, of course, but it would require users to configure 1-to-1 mappings between public and private IPs for every VM they want exposed to the internet. Also, the ability to leak routes from VRFs shouldn't be removed, as it may come in handy sometimes, but it should be made optional.

spirit · Feb 15, 2024

janus said:
Hi Spirit,

your example works just fine, it just implements full 2-way NAT for each VM, which may not always be needed and certanly goes beyond what EVPN SNAT option is supposed to do. Meaning, you can both initiate connection from a VM to the outside world and from the outside world to the VM, which is the best solution, but requires a public IP + DNAT rule, per VM, which is too much to configure if you only need VM-initiated traffic and the VM is not exposed to the outside world (which would be the equivalent to the existing EVPN SNAT option, but without leaking VRF routes to the main routing table) . In that case, you only need simple masquerading:

Code:

iptables -t mangle -I PREROUTING -i <vnet> -s <subnet> -j CONNMARK --set-mark <m> iptables -t mangle -I PREROUTING ! -i <vnet> -j CONNMARK --restore-mark iptables -t nat -I POSTROUTING -o <host's lan iface> -m connmark --mark <m> -j MASQUERADE ip rule add fwmark <m> lookup <vrf table>

That's it. This is basically the SNAT half of your example. It allows VMs connected to <vnet> to initiate connections to the outside world using host's LAN address and, of course, for their traffic to return, but not the other way around.

Having DNAT as an option would be nice, of course, but it would require users to configure 1-to-1 mappings between public and private IPs for every VM they want exposed to the internet. Also, the ability to leak routes from VRFs shouldn't be removed, as it may come in handy sometimes, but it should be made optional.

Ok, perfect, thanks !

Search

Search

EVPN Multiple Tenants With Overlapping Subnets

janus

New Member

spirit

Distinguished Member

janus

New Member

spirit

Distinguished Member

janus

New Member

spirit

Distinguished Member