Hello,
I've been playing with EVPN SDN on Proxmox VE 8.1.4 and I love it so far, but there are a few things that got me scratching my head.
I apologize in advance if this topic has already been discussed, but the forum is just too huge to comb it all.
Anway, I got a 2 node cluster and I've been testing SDN between the nodes, first with one evpn zone, which works fine (except the routing loop when using multiple exit nodes, which has been discussed before). Then, I wanted to test with another evpn zone alongside the first one ...
The way current implementation works, when exit nodes are used, vnet subnets are:
a) exported from BGP to kernel main routing table, if exit node local routing is not enabled
b) set as static routes in the main routing table, using veth interface pair between vrf and the main routing table as a return path to the vrf, if exit node local routing is enabled
In either case, the very fact that vrf routes are leaked into the main routing table makes it impossible to have multiple tenants with overlapping subnets. This is ok if tenants use public adresses which are unique or you manage the tenants' address plan and you can divide private subnets between tenants, but in case you want to let tenants manage their own address spaces, you're in trouble. When I added the same subnet to both evpn zones, Proxmox managed to add/leak the route only for the first zone I added the subnet to, as expected. As a result, only the first zone had functional outside connectivity.
Of course, this is easily solvable by not enabling exit nodes at all and provisioning a gateway container/VM, even an appliance VM like vyOS/MikroTik/pfSense/opnSense, with one interface connected to the main routing table and the other connected to the vrf. But this is beside the point. My goal is to achieve SNAT on exit nodes without leaking vnet subnets into the main routing table and thus allowing tenants to use whatever addresses they want, even overlapping ones - without external elements. It goes without saying that this model implies complete isolation between tenants at all times, which may not be suitable for everyone, but could be useful to (small) private cloud providers.
Manually, I can do this by enabling a single exit node with local routing enabled, which would provision veth pair between the main routing table and the vrf, and then manually removing vnet routes. Next, I've noticed that veth pairs in both evpn zones use the same private subnet 10.255.255.0/30, which is a topic for another discussion, so I manually changed one to 10.255.255.4/30 on the veth pair in the second evpn zone, with .5 being on the "outside" interface (xvrf_<zone>) and .6 being on the "inside" interface (xvfrp_<zone>). Lastly, I added the default route pointing to 10.255.255.5 on the outside end of the veth pair to the vrf. Now the vrf traffic can directly reach the host, but cannot return, because I have removed the vnet routes from the main table.
This is where it gets ugly. In order to avoid adding any routes to the host, especially vnet ones, I had to do double masquerade:
1) first, when traffic exits the vrf via xvrfp_ interface and enters the veth pipe, in order to translate it's source address into something that is already known to the host via connected route on xvrf_ interface (because both xvfr_ and xvfrp_ are in the same /30 subnet).
2) second, when traffic, exiting the veth pipe and entering the host via xvrf_ interface, needs to reach anything outside the host itself, by masqueading it into host's lan ip
It gets worse if host's lan ip is also a private address. Then the internet gateway on the network's edge will do yet another SNAT, but, at least it works.
On the other hand, if /30 subnets, assigned to veth pairs automatically, could be manually configured via Proxmox interface/API, a public subnet could be assigned, making the second masquerade unnecessary. Of course, this would require advertising those public subnets from the Proxmox host to the rest of the network via BGP (recommended) or IS-IS/OSPF (not recommended) or setting static routes for those subnets on a device Proxmox host uses as the default gateway (not ideal, but a must if no routing protocol are in use network-wise).
My questions for you folks are:
1. Has anyone had the same problem/requirements or am I the only one that needs this ?
2. Does anyone have a better idea/more elegant solution for this, in general ?
3. Does this scenario make enough sense to make it's way into the official Proxmox VE at some point, as one of the evpn options ?
I've been playing with EVPN SDN on Proxmox VE 8.1.4 and I love it so far, but there are a few things that got me scratching my head.
I apologize in advance if this topic has already been discussed, but the forum is just too huge to comb it all.
Anway, I got a 2 node cluster and I've been testing SDN between the nodes, first with one evpn zone, which works fine (except the routing loop when using multiple exit nodes, which has been discussed before). Then, I wanted to test with another evpn zone alongside the first one ...
The way current implementation works, when exit nodes are used, vnet subnets are:
a) exported from BGP to kernel main routing table, if exit node local routing is not enabled
b) set as static routes in the main routing table, using veth interface pair between vrf and the main routing table as a return path to the vrf, if exit node local routing is enabled
In either case, the very fact that vrf routes are leaked into the main routing table makes it impossible to have multiple tenants with overlapping subnets. This is ok if tenants use public adresses which are unique or you manage the tenants' address plan and you can divide private subnets between tenants, but in case you want to let tenants manage their own address spaces, you're in trouble. When I added the same subnet to both evpn zones, Proxmox managed to add/leak the route only for the first zone I added the subnet to, as expected. As a result, only the first zone had functional outside connectivity.
Of course, this is easily solvable by not enabling exit nodes at all and provisioning a gateway container/VM, even an appliance VM like vyOS/MikroTik/pfSense/opnSense, with one interface connected to the main routing table and the other connected to the vrf. But this is beside the point. My goal is to achieve SNAT on exit nodes without leaking vnet subnets into the main routing table and thus allowing tenants to use whatever addresses they want, even overlapping ones - without external elements. It goes without saying that this model implies complete isolation between tenants at all times, which may not be suitable for everyone, but could be useful to (small) private cloud providers.
Manually, I can do this by enabling a single exit node with local routing enabled, which would provision veth pair between the main routing table and the vrf, and then manually removing vnet routes. Next, I've noticed that veth pairs in both evpn zones use the same private subnet 10.255.255.0/30, which is a topic for another discussion, so I manually changed one to 10.255.255.4/30 on the veth pair in the second evpn zone, with .5 being on the "outside" interface (xvrf_<zone>) and .6 being on the "inside" interface (xvfrp_<zone>). Lastly, I added the default route pointing to 10.255.255.5 on the outside end of the veth pair to the vrf. Now the vrf traffic can directly reach the host, but cannot return, because I have removed the vnet routes from the main table.
This is where it gets ugly. In order to avoid adding any routes to the host, especially vnet ones, I had to do double masquerade:
1) first, when traffic exits the vrf via xvrfp_ interface and enters the veth pipe, in order to translate it's source address into something that is already known to the host via connected route on xvrf_ interface (because both xvfr_ and xvfrp_ are in the same /30 subnet).
2) second, when traffic, exiting the veth pipe and entering the host via xvrf_ interface, needs to reach anything outside the host itself, by masqueading it into host's lan ip
Code:
iptables -t nat -I POSTROUTING -o xvrfp_+ -j MASQUERADE
iptables -t mangle -I PREROUTING -i xvrf_+ -j MARK --set-mark 123
iptables -t nat -I POSTROUTING -m mark --mark 123 -o <hosts lan interface> -j MASQUERADE
It gets worse if host's lan ip is also a private address. Then the internet gateway on the network's edge will do yet another SNAT, but, at least it works.
On the other hand, if /30 subnets, assigned to veth pairs automatically, could be manually configured via Proxmox interface/API, a public subnet could be assigned, making the second masquerade unnecessary. Of course, this would require advertising those public subnets from the Proxmox host to the rest of the network via BGP (recommended) or IS-IS/OSPF (not recommended) or setting static routes for those subnets on a device Proxmox host uses as the default gateway (not ideal, but a must if no routing protocol are in use network-wise).
My questions for you folks are:
1. Has anyone had the same problem/requirements or am I the only one that needs this ?
2. Does anyone have a better idea/more elegant solution for this, in general ?
3. Does this scenario make enough sense to make it's way into the official Proxmox VE at some point, as one of the evpn options ?
Last edited: