SDN evpn issue - routing unstable all routing stops every few days

kwslavens

Member
Sep 8, 2022
10
1
8
We've got a 5 node proxmox cluster running with 2 vxlan zones and a evpn zone with exit nodes on all 5 servers.
The evpn zone works as expected but every few days routing completely stops for all virtual devices in the zone. Devices in the zone can still talk to each other but no routing out the frr is possible other than icmp (pings). We can ping north / south and east west... but no other protocols seem to function. Re-apply the SDN configuration and the problem disappears. I'm assuming that restarting the frr service clears the issue.

We've have not found any error messages indicating a problem on the hosts. The frr.log doesn't show any issues.

Has anyone else experiences this issue? We're seeing the loss of routing every few days.
 
Last edited:
I've enabled debug on all 5 hosts. Waiting for the issue to happen again and I'll post log data.
 
Hi. That's strange than icmp is working but not other protocol. evpn is about routing mac && ip address, not specific protocol.

(I'm running 100 nodes with evpn, so I'm sure 100% sure that current kernel && frr don't have bug).

How do you route traffic from the exit-node to external word ? (simple static gateway ? bgp ?).

Do you use a primary exit-node ? or do you load balance between all exit-nodes ? (maybe icmp could be balanced differently than tcp if all exit-node are active)
 
Hi. That's strange than icmp is working but not other protocol. evpn is about routing mac && ip address, not specific protocol.

(I'm running 100 nodes with evpn, so I'm sure 100% sure that current kernel && frr don't have bug).

How do you route traffic from the exit-node to external word ? (simple static gateway ? bgp ?).

Do you use a primary exit-node ? or do you load balance between all exit-nodes ? (maybe icmp could be balanced differently than tcp if all exit-node are active)

Just simple default gateway. I have the 3rd node set as the primary exit-node and all 5 nodes are configured as exit-nodes.
Its setup pretty much directly from the SDN example for evpn.

Its good to hear that you've got 100 nodes using evpn with no issues. Gives me hope. Also makes me wonder what did I do wrong. If it works fine on a large deployment the what is different with this 5 node cluster.

The cluster is running on 5 Cisco blades, so I couldn't use the ISO installer. I had to install Debian 11 first and get multipath working correctly for the storage controllers, then installed PVE on top. I wouldn't think that could be a source of a problem.
I attempted to upgrade to the newest stable frr.... found out pretty quick that doesn't work. I'm assuming there are some changes in the frr package to include the pve config files. Rolled back to the standard pve package version.

Any advice on how to troubleshoot the issue to track down the cause?
 
Last edited:
Just simple default gateway. I have the 3rd node set as the primary exit-node and all 5 nodes are configured as exit-nodes.
Its setup pretty much directly from the SDN example for evpn.

how did-you setup the routes on the other side (in your router)?
you should have done something like "route add <evpn zone subnets> gw xxxx" .

So, depend of your router model, how do you have implemented that ? are you able use multiple gw ? (ecmp routing ?). if not, you need somekind of keepalived vip on the proxmox exit-nodes. ( or use bgp between your routers and proxmox exit-nodes)



Note that 2 exit-nodes are enough, for redundancy. (for example, in my network, I'm using 2 arista switch (supporting evpn) as exit-node, with 100 hypervisors behind).


Its good to hear that you've got 100 modes using evpn with no issues. Gives me hope. Also makes me wonder what did I do wrong. If it works fine on a large deployment the what is different with this 5 node cluster.

The cluster is running on 5 Cisco blades, so I couldn't use the ISO installer. I had to install Debian 11 first and get multipath working correctly for the storage controllers, then installed PVE on top. I wouldn't think that could be a source of a problem.
I attempted to upgrade to the newest stable frr.... found out pretty quick that doesn't work. I'm assuming there are some changes in the frr package to include the pve config files. Rolled back to the standard pve package version.
you just need to use frr && ifupdown2 coming from proxmox repository. (frr had a lot of bugs in past, I have well tested it with evpn, sometime include patch from stable branch, because frr sometimes don't release minor version, you need to compile it yourself, so use the proxmox package version ;)
Any advice on how to troubleshoot the issue to track down the cause?

If vms are still able to communicate in evpn network, it's that evpn is working correctly, and problem is coming from outside not able to join the evpn network.
 
I'm not sure I understand completely. We have nothing configured outside of ProxMox. We're not routing the evpn network zone from external devices. Basicly treating it as a firewall / NAT network. Outbound originating traffic only. Anything needing to access resources from the evpn zone, I'm using an HA-Proxy setup or for SSH access a VM Jump Box.
 
Last edited:
I'm not sure I understand completely. We have nothing configured outside of ProxMox. We're not routing the evpn network zone from external devices. Basicly treating it as a firewall / NAT network. Outbound originating traffic only. Anything needing to access resources from the evpn zone, I'm using an HA-Proxy setup or for SSH access a VM Jump Box.
oh ok, so only outbound traffic with nat. could you share your /etc/pve/sdn/*.cfg ?

I really don't see why it could drop (until you don't have any primary exit-node shutdown).
But if you have a failver of exit-node, you need to use something like conntrackd to sync conntrack between the exit-nodes. Or current established connections will be dropped.

another possibility:

Also, you need to increase conntrack max; because the default is 32000. If conntrack is satured (could be a ddos for example)., no more connection is possible.

to verify:

#apt-install conntrack
#conntrack -L
 
Last edited:
Cfg below. Only the vxprod network fails. the others keep working when vxprod stops.
Conntrack indicates less than 500 flow entries.
I'll increase the max tracking but it doesn't look like that is a problem.

evpn: vxprod asn 65000 peers 172.20.98.32, 172.20.98.33, 172.20.98.34, 172.20.98.35, 172.20.98.36 powerdns: pdnsauth1 key ------------------------------------------------------------------------------- url http://pdns-auth1.ccst.net:8081/api/v1/servers/localhost ttl 300 pve: pve netbox: NetBox1 token ---------------------------------------------------------------------------- url http://netbox1.ccst.net:8000/api subnet: vxrisky-10.9.0.0-24 vnet HighRisk dnszoneprefix vxrisky.ccst.net gateway 10.9.0.1 subnet: vxtstlab-10.10.0.0-21 vnet TestLab dnszoneprefix testlab.ccst.net gateway 10.10.0.1 subnet: vxprod-10.11.0.0-20 vnet vxprod gateway 10.11.0.1 snat 1 subnet: vxprod-192.168.100.0-24 vnet vxprod gateway 192.168.100.1 snat 1 subnet: vxprod-192.168.122.0-24 vnet vxprod gateway 192.168.122.1 snat 1 vnet: HighRisk zone vxrisky alias vxrisky-vnets tag 100000 vnet: TestLab zone vxtstlab alias vxtstlab-testlab tag 100001 vnet: vxprod zone vxprod alias vxprod Primary production tag 11000 vxlan: vxrisky peers 10.8.0.1,10.8.0.2,10.8.0.3,10.8.0.4,10.8.0.5 dns pdnsauth1 dnszone ccst.net ipam NetBox1 mtu 1450 reversedns pdnsauth1 vxlan: vxtstlab peers 10.8.0.1,10.8.0.2,10.8.0.3,10.8.0.4,10.8.0.5 dns pdnsauth1 dnszone ccst.net ipam NetBox1 mtu 1450 reversedns pdnsauth1 evpn: vxprod controller vxprod vrf-vxlan 10000 dns pdnsauth1 dnszone ccst.net exitnodes ccst-ostackbbu4,ccst-ostackbbu3 exitnodes-primary ccst-ostackbbu3 ipam NetBox1 mac 32:F4:05:FE:6C:0A mtu 1450 reversedns pdnsauth1
 
This may be nothing... but I can't seem to find the the ip_conntrack_max value at all. Checked the path... and it truly isn't there. On all 5 hosts.

sysctl net.ipv4.netfilter.ip_conntrack_max
sysctl: cannot stat /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or directory

I did find this however
sysctl net.nf_conntrack_max
net.nf_conntrack_max = 262144



It appears at first glance that values are missing.
I don't have anything like conntrackd installed, didn't know that I needed it. I'll look into that.
 
Last edited:
This may be nothing... but I can't seem to find the the ip_conntrack_max value at all. Checked the path... and it truly isn't there. On all 5 hosts.

sysctl net.ipv4.netfilter.ip_conntrack_max
sysctl: cannot stat /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or directory

I did find this however
sysctl net.nf_conntrack_max
net.nf_conntrack_max = 262144
the proc path is : /proc/sys/net/nf_conntrack_max
(should be the same value)

It appears at first glance that values are missing.
I don't have anything like conntrackd installed, didn't know that I needed it. I'll look into that.
conntrackd is not mandatory, but I recommend it in case of failover, to avoid connections hang of currently established connections.


Looking at your config, it seem pretty basic, I really don't known why it's hanging after X days.

for debug, if you have the problem, it should be interesting to see the result of (on each node):

# vtysh -c "sh ip bgp l2vpn evpn"
# vtysh -c "sh ip bgp summary"
# vtysh -c "sh ip route vrf all"


BTW, What is your current running kernel version ?
 
the proc path is : /proc/sys/net/nf_conntrack_max
(should be the same value)


conntrackd is not mandatory, but I recommend it in case of failover, to avoid connections hang of currently established connections.


Looking at your config, it seem pretty basic, I really don't known why it's hanging after X days.

for debug, if you have the problem, it should be interesting to see the result of (on each node):

# vtysh -c "sh ip bgp l2vpn evpn"
# vtysh -c "sh ip bgp summary"
# vtysh -c "sh ip route vrf all"


BTW, What is your current running kernel version ?


I'll gather that output from all nodes as soon as the issue happens again. It last happened on 11/4 so it should be any time. All 5 nodes are patched up and current. Kernel version below.


uname -a

Linux ccst-ostackbbu1 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200) x86_64 GNU/Linux
 
Quick Update. Its been more than a week now with no issue at all. We've been unable to further troubleshoot since the issue hasn't happened again. This is out of the norm. We've seen the problem every 3-4 days at most for several months.

Only Three things have changed.
1). The FRR and Tools packages where removed and re-installed completely.
2). The number of exit nodes has been reduced from 5 to 2. (One primary one secondary)
3). All host have been patched and updated to 100% match code versions.

That is it. Either its just random luck, or one of those 3 things solved the problem.
I will come back in couple of weeks and update if the problem still has not resurfaced Or sooner if the problem does happen again.

Fingers Crossed!
 
  • Like
Reactions: spirit

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!