SDN VXLAN for private network in a cluster - how to configure properly?

stevops

New Member
Aug 7, 2022
21
1
1
Hi Proxmoxers out there ;)

here is an abstraction of what I finally want to achieve in general:
Untitled Diagram.drawio.png

That means:
  • There is a cluster of at least two PVEs behind a firewall managed by PVE
    • An actor should be able to access the PVEs via SSH
  • Within the cluster I want to have an internal network
    • In general instances in the internal network should be hidden to the outside world (incoming traffic) but should be able to communicate with the outside world (later on only to linux update servers etc. but no restrictions right now at the moment)
    • The instances of the internal network are spread all over the nodes and might be migrated from node to node
    • An admin and other people should be able to access a reverse proxy that would also serve content from "More Services"
    • An admin should be able to access a vpn server to ultimately connect to instances in the internal network via SSH or other protocol
Then:
  • Giving the admin access to the PVE instances is straight forward using the PVE datacenter and PVE node firewall.
  • Building the internal network crossing multiple PVE instances I came across SDNs and more specific VXLANs. Because seperate Linux Bridges on every node would overcomplicate the routing configuration.
    • Enabling SDN and creating a VXLAN without specific configuration was simple so far.
    • But how does the VXLAN/VNet/PVE nodes/VMs need to be configured that the above scenario applies?
      • I would be happy if there is a solution that is independent of an additional gateway/router. It would be the best in my eyes if all the routing can be enabled in the datacenter/node level.
      • How can incoming traffic from the internet be routet to the internal networks specific instances?
      • How can outgoing traffic from the internal network be routet to the internet?
      • Do I need a Subnet within the VNet?
      • ...

Please tell me if I need to provide more information. I think more details of what I tried might be misleading at the first place so I want to keep the more abstract view to my problem if that is ok.
 
Last edited:
Hi,

Nice question. Even myself I was thinking for something like this, but not exactly in the same way.... I am only scratch this ideea...: My desired concept / ideea could be :

- I want to keep PMX nodes only with minimum additional non-Proxmox packages as it is possible, so the ideea of SDN on top of PMX, is not what I would like to do(not 100% excluded)
- I would like to do VXLAN on my network devices(routers/switches, in my case all of then can do it), because I consider it is easier, and more safe(and I think that a mistake on CDN, could affect all my PMX nodes, but will not be the case using VXLAN on routers/switches, maybe I am wrong, or I over-estimate my confidence ....)
- regarding routing I would like to use ospf, it is more simple, it is dynamic, and could have redundant routing path

I do not start to do anything(only some draft drawings on paper, reading some documentation about how VXLAN can be donne on linux), but in the near future, I would like to spend more time on this concept.



Good luck / Bafta !
 
Last edited:
But how does the VXLAN/VNet/PVE nodes/VMs need to be configured that the above scenario applies?
  • I would be happy if there is a solution that is independent of an additional gateway/router. It would be the best in my eyes if all the routing can be enabled in the datacenter/node level.
  • How can incoming traffic from the internet be routet to the internal networks specific instances?
  • How can outgoing traffic from the internal network be routet to the internet?
  • Do I need a Subnet within the VNet?

you need bgp-evpn vxlan if you want to do routing between differents vxlan. (and each proxmox node is an anycast gateway for each vnet, and you need to define subnets with gateway ip).

exit-nodes need to be cofigure in evpn zone, to forward traffic to real network.
 
Hi @spirit , thank you for answering. You wrote that bgp-evpn is neccessary if I want to do routing between different VxLANs. As far as I understand I would like to avoid multiple VxLANs. One VxLAN is at least enough to connect the 10.0.0.x instances (they can ping each other in the same VxLAN/VNet).
 
Hi @spirit , thank you for answering. You wrote that bgp-evpn is neccessary if I want to do routing between different VxLANs. As far as I understand I would like to avoid multiple VxLANs. One VxLAN is at least enough to connect the 10.0.0.x instances (they can ping each other in the same VxLAN/VNet).
I said that , because you said

I would be happy if there is a solution that is independent of an additional gateway/router. It would be the best in my eyes if all the routing can be enabled in the datacenter/node level.

I don't known where is your current gateway ? (the clusterfw on the schema ?).
if yes, you need to route real world && vxlan here with 2 nics + nat, no need to evpn, only a static vxlan.
 
Oh, sorry, I missed to answer. I had another issue to solve first and also to investigate the whole technology first and then forgot to answer ;)
@spirit Could you please have a look at my current setup again? Here is what I got:

- A EVPN Controller
- An EVPN zone with both PVEs as exit nodes
- A VNet within the EVPN zone
- A subnet for my internal network
000049_2022-10-30 15_06_40-Document1 - Word.png

And here is the network map a little more simplified with focus on internal routing first:
- One PVE holds the reverse proxy
- One PVE holds VPN server
The network interfaces of both are attached to bridge invnet3 with the corresponding gateway.



Untitled Diagram(2).drawio.png
Cluster FW is turned off for testing.
Right now: Should I be able to ping 10.0.0.3 from 10.0.0.2 and vice versa?
Seems that I can't. No matter if I activate SNAT for the subnet or not (in the end I would like to enable SNAT but that is another story for later).

Is EVPN not enough? Do I need BGP-EVPN? But as far as I understand that also means I must have different Subnets to route between (one for PVE1 and one for PVE2???) and that would be nasty if I wanted to migrate a machine to another PVE host with the need to change the IP address.


EDIT: Given the configuration above, the following is now possible:
10.0.0.3 can ping 10.0.0.2 and vice versa.
PVE1 can ping 10.0.0.2 but PVE1 cannot ping 10.0.0.3
PVE2 can ping 10.0.0.3 but PVE2 cannot ping 10.0.0.2

That PVE1 cannot ping 10.0.0.3 and so on is a problem for me:
If I expose 10.0.0.3:1194 via PVE2 (iptables -t nat -A PREROUTING -i vmbr0 -p udp --dport 1194 -j DNAT --to-destination 10.0.0.3:1194) it works.
If I expose 10.0.0.3:1194 via PVE1 (iptables -t nat -A PREROUTING -i vmbr0 -p udp --dport 1194 -j DNAT --to-destination 10.0.0.3:1194) it will not work. What else needs to be done?

In the end I want to have a keepalived setup where PVE1 and PVE2 share the virtual IP 192.168.2.15 and all exposed ports should be accessable via that IP. And also I want to be able to migrate container from one PVE to the other without reconfiguring iptables etc. So there ultimately must be a solution to route traffic from for example PVE1 to a machine hosted on PVE2 like 10.0.0.3.
 

Attachments

  • 000049_2022-10-30 15_06_40-Document1 - Word.png
    000049_2022-10-30 15_06_40-Document1 - Word.png
    43.7 KB · Views: 89
  • 000049_2022-10-30 15_06_40-Document1 - Word.png
    000049_2022-10-30 15_06_40-Document1 - Word.png
    64.1 KB · Views: 75
Last edited:
Okay, one decisive fault of mine was to not define a primary exit node. Got that. With a primary exit node set (PVE1) it is as follows:

Connect from PC --> PVE1(192.168.2.10/10.0.0.1):1194 --> VPN Server (10.0.0.3):1194 works, ALTHOUGH the VPN server is hosted on PVE2
Connect from PC --> PVE2(192.168.2.11/10.0.0.1):1194 --> VPN Server (10.0.0.3):1194 works, BECAUSE the VPN server is hosted on PVE2

Next step with Keepalived (Master/Backup with vIP 192.168.2.15):
If I wire the primary exit node with the capability of beeing the HA master in Keepalived it also works as expected:

A) Master (PVE1) is up and PVE1 is also the primary exit node if it is up
PC --> PVE HA(192.168.2.15/10.0.0.1):1194 --> VPN Server (10.0.0.3):1194 works, because traffic goes from 192.168.2.15 --> 192.168.2.10 --> ...

B) Master is down, backup (PVE2) jumps in, PVE2 is now also the primary exit node
PC --> PVE HA(192.168.2.15/10.0.0.1):1194 --> VPN Server (10.0.0.3):1194 works, because traffic goes from 192.168.2.15 --> 192.168.2.11 --> ...

I hope that is all intended behaviour :)

Now, whats finally interesting for me and what I will try out as soon as I can: What about PVE1, PVE2 and a third PVE3 together if every PVE is also an exit node? If the primary exit node goes down I expect that I have to set priorities of Keepalived and primary exit node takeover the same way. Like so:
PVE 1: Keepalived Master and primary exit node
PVE 2: Keepalived Backup with higher priority than PVE3 (Keepalived and exit node takeover)
PVE 3: Keepalived Backup with lower priority than PVE2 (Keepalived and exit node takeover)
 
Last edited:
Hi, the "primary" exit node option, change the weight of the default evpn route, so outgoing traffic is going through this node on priority.
Other, secondary nodes, have currently same lower weight. (because nobody have asked about more than 2 exit-nodes with failover/active-backup. And it became to be complex if you want NAT + sync conntrack between more than 2 nodes.

So for now, only 2 exit nodes are supported with primary option.
 
  • Like
Reactions: stevops
Hi, the "primary" exit node option, change the weight of the default evpn route, so outgoing traffic is going through this node on priority.
Other, secondary nodes, have currently same lower weight. (because nobody have asked about more than 2 exit-nodes with failover/active-backup. And it became to be complex if you want NAT + sync conntrack between more than 2 nodes.

So for now, only 2 exit nodes are supported with primary option.
dont forget one thing about the sdn - the limit is round about 10 VMs per 1GB NIC.
otherwise the performance goes down and you can not even open a website from any server standing in the other Nodes.
i learned it the hard way. after i put a 10gb in, the sdn worked again. i have round about 10 sdn as xvlan and 4 nodes, with round about 30 VMs. the performance is really a huge factor.
as conclusion - its not for bit networks, or you need to get the traffic outside of your nodes (but this szenario i dont know how the performance will be)
 
dont forget one thing about the sdn - the limit is round about 10 VMs per 1GB NIC.
otherwise the performance goes down and you can not even open a website from any server standing in the other Nodes.
i learned it the hard way. after i put a 10gb in, the sdn worked again. i have round about 10 sdn as xvlan and 4 nodes, with round about 30 VMs. the performance is really a huge factor.
as conclusion - its not for bit networks, or you need to get the traffic outside of your nodes (but this szenario i dont know how the performance will be)
Well, I have some big host with 600vms + evpn.
They are no correlation between bandwith && number of vms.
of course, vxlan encapsulation use more cpu (correlated to the bandwith usage), but good nic (generally > 10gbit card) have native vxlan offloading.
 
Well, I have some big host with 600vms + evpn.
They are no correlation between bandwith && number of vms.
of course, vxlan encapsulation use more cpu (correlated to the bandwith usage), but good nic (generally > 10gbit card) have native vxlan offloading.

sure there is a corralation - if you have 20 VMs over your cluster, include in SDN than they need to talk to each other. for example your filer is on node 1 but your ldap for auth is on node 4. every time you access the Files your permissions getting queried, or ssh sessions (perm stream). and now, like me, have 10 SDNs on 1 x 1GB, different subnets and so on
i experienced the performance issue with round about 12 VMs on a 1GB SDN connection and 4 hosts (all 4 SDN stretches over all 4 NOdes). i replaced the 1gb with a 10gb (no other changes are made) and now i dont experience lags anymore (in windows rdp - laggy and delay, i could not open websites hosted on the same subnet from a server in another node)
you telling me the evpn have a better performance ?
can you tell me more about your config ?
i am wondering how 600 VMs on different Nodes taling to each other over 1 x 10gb card !! (but maybe in a PM or new thread)
i want to open a new post for you guys experience if the cluster of mine can be improved. will do it in the next couple of days.
 
  • Like
Reactions: David Herselman
Oh, sorry, I missed to answer. I had another issue to solve first and also to investigate the whole technology first and then forgot to answer ;)
@spirit Could you please have a look at my current setup again? Here is what I got:

- A EVPN Controller
- An EVPN zone with both PVEs as exit nodes
- A VNet within the EVPN zone
- A subnet for my internal network
View attachment 42750

And here is the network map a little more simplified with focus on internal routing first:
- One PVE holds the reverse proxy
- One PVE holds VPN server
The network interfaces of both are attached to bridge invnet3 with the corresponding gateway.



View attachment 42747
Cluster FW is turned off for testing.
Right now: Should I be able to ping 10.0.0.3 from 10.0.0.2 and vice versa?
Seems that I can't. No matter if I activate SNAT for the subnet or not (in the end I would like to enable SNAT but that is another story for later).

Is EVPN not enough? Do I need BGP-EVPN? But as far as I understand that also means I must have different Subnets to route between (one for PVE1 and one for PVE2???) and that would be nasty if I wanted to migrate a machine to another PVE host with the need to change the IP address.


EDIT: Given the configuration above, the following is now possible:
10.0.0.3 can ping 10.0.0.2 and vice versa.
PVE1 can ping 10.0.0.2 but PVE1 cannot ping 10.0.0.3
PVE2 can ping 10.0.0.3 but PVE2 cannot ping 10.0.0.2

That PVE1 cannot ping 10.0.0.3 and so on is a problem for me:
If I expose 10.0.0.3:1194 via PVE2 (iptables -t nat -A PREROUTING -i vmbr0 -p udp --dport 1194 -j DNAT --to-destination 10.0.0.3:1194) it works.
If I expose 10.0.0.3:1194 via PVE1 (iptables -t nat -A PREROUTING -i vmbr0 -p udp --dport 1194 -j DNAT --to-destination 10.0.0.3:1194) it will not work. What else needs to be done?

In the end I want to have a keepalived setup where PVE1 and PVE2 share the virtual IP 192.168.2.15 and all exposed ports should be accessable via that IP. And also I want to be able to migrate container from one PVE to the other without reconfiguring iptables etc. So there ultimately must be a solution to route traffic from for example PVE1 to a machine hosted on PVE2 like 10.0.0.3.
btw: i use opnsense.
all Public IPs are on the OPNsense and i created as much subnets i needed. opnsense do the rest - proxy, vpn, loadbalancing, nat, etc. all you need. if you like i can send you a small drawing.
 
sure there is a corralation - if you have 20 VMs over your cluster, include in SDN than they need to talk to each other. for example your filer is on node 1 but your ldap for auth is on node 4. every time you access the Files your permissions getting queried, or ssh sessions (perm stream). and now, like me, have 10 SDNs on 1 x 1GB, different subnets and so on
i experienced the performance issue with round about 12 VMs on a 1GB SDN connection and 4 hosts (all 4 SDN stretches over all 4 NOdes). i replaced the 1gb with a 10gb (no other changes are made) and now i dont experience lags anymore (in windows rdp - laggy and delay, i could not open websites hosted on the same subnet from a server in another node)
you telling me the evpn have a better performance ?
Well, this thread was about evpn, so I assumed that you used evpn too.
with evpn, bum traffic (arp,broadcast, unknown unicast,..) is not flooded to all vxlan tunnels. instead the bgp controller exchange ip/mac between hosts.

This is the only difference, and maybe for this, without evpn, the more vm you have, the more arp/broadcast traffic you'll have.


can you tell me more about your config ?
i am wondering how 600 VMs on different Nodes taling to each other over 1 x 10gb card !! (but maybe in a PM or new thread)
i want to open a new post for you guys experience if the cluster of mine can be improved. will do it in the next couple of days.
(I have 4000 vms in production, on 100 hypervisors, with east-west , north-south traffic). Mosly web/database with very big website.
(running 2x10gb on each host).

but definitlly, if you want to scale, you should try evpn.
 
  • Like
Reactions: pille99
btw: i use opnsense.
all Public IPs are on the OPNsense and i created as much subnets i needed. opnsense do the rest - proxy, vpn, loadbalancing, nat, etc. all you need. if you like i can send you a small drawing.
Hi Pille, I would appreciate that even if it is just for learning purposes :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!