Slow speeds using EVPN

Manih

Member
Aug 18, 2020
7
0
6
43
Hello all,

I have been testing the SDN network feature and successfully set up a EVPN zone using the steps outlined in the documentation: https://pve.proxmox.com/pve-docs/chapter-pvesdn.html#pvesdn_zone_plugin_evpn

Everything seems to be working. But the speed I am getting between hosts is a bit slower than I expected. I am using bonded 25G mellanox adapters, that should have VXLAN offloading so I would think they should reach similar speeds as with VLANs.

Does anyone have experience with this. Are these speeds normal? Or, is there something that I am missing?


Here is the speed I am getting between two VMs with a single thread with only VLAN encapsulation:
# iperf -c 172.24.210.239 ------------------------------------------------------------ Client connecting to 172.24.210.239, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 1] local 172.24.210.154 port 49416 connected with 172.24.210.239 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-10.0167 sec 8.03 GBytes 6.89 Gbits/sec

On the same two VMs I am getting a bit less
~# iperf -c 10.22.22.1 ------------------------------------------------------------ Client connecting to 10.22.22.1, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 1] local 10.22.22.3 port 36520 connected with 10.22.22.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-10.0251 sec 4.81 GBytes 4.12 Gbits/sec

As far as I can tell, all offloading options that might matter for VXLAN are enabled:

~# ethtool -k bond0 Features for bond0: rx-checksumming: off [fixed] tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [requested on] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on [fixed] ntuple-filters: off [fixed] receive-hashing: off [fixed] highdma: on rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: on [fixed] netns-local: on [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: off [fixed] tx-tunnel-remcsum-segmentation: off [fixed] tx-sctp-segmentation: off [requested on] tx-esp-segmentation: off tx-udp-segmentation: on tx-gso-list: off [requested on] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: on [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off esp-tx-csum-hw-offload: off rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] rx-gro-list: off macsec-hw-offload: off [fixed] rx-udp-gro-forwarding: off hsr-tag-ins-offload: off [fixed] hsr-tag-rm-offload: off [fixed] hsr-fwd-offload: off [fixed] hsr-dup-offload: off [fixed]
 
I forgot to mention that I have adjusted the MTU on the host to 9000 and the interface on the VMs to 1500.

The host network is like:

Interfaces(ens10f0 and ens10f1)->bond0->vmbr0

Code:
2: ens10f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
3: ens10f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000

7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000

10: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000

42: vxlan_testnet: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master testnet state UNKNOWN mode DEFAULT group default qlen 1000

And on the VMs:

root@vxlantest:~# ip link show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
 
Hi, I'm doing test with connect-x4,
I'm currently algo getting around 4gbit/s from vm to vm.

I'm seeing 1 ksoftirqd process on host doing some spike to 40% of 1 core. That's seem curious.


Doing the same iperf2 test from hypervisor directly, using -P2, I'm able to reach 20gbit/s without problem,
and no ksoftirqd at all.

for testing, between the 2 hosts, I'm doing a direct vxlan interface, without any bridge

host1
Code:
ip link add name vxlan16 type vxlan id 16 dev bond0 remote <ipofhost2> dstport 4789
ifconfig vxlan16 192.168.60.10/24 up
ifconfig vxlan16

host2
Code:
ip link add name vxlan16 type vxlan id 16 dev bond0 remote <ipofhost1> dstport 4789
ifconfig vxlan16 192.168.60.11/24 up
ifconfig vxlan16

then iperf between 192.168.60.X


I'll try to do more tests
 
I have done same test, with create a veth pair on hypervisor, plugged on the bridge, with ip address on the veth interface.
(so no virtualisation involved).

I have same problem, around 4-5 gbits/s with ksoftirqd high.

So , it's seem than offloading is not working when bridge is forwarding packet to a logical interface.

I will dig into the kernel documentation.
 
Hi, that is interesting. I haven't tried using a bridge on the host.

I might try to install the newest drivers from mellonox (mlnx-en). But I'm not sure which package to try since proxmox is using debian but with ubuntu kernel. Have you tried installing those?
 
just tested with kernel 5.10, it seem to working fine.

kernel 5.13->5.19 have bad performance.

I'm going to try 5.11, but it's look like a regression. (not sure if it's from mellanox driver or something other in kernel)
 
kernel 5.11 is slow.

So, something seem to have changed here. (I have seen some change in mlx5 driver 5.11 with offloading, but reverted in recent 5.15 /5.19).

Could you try to install kernel 5.10 to compare on your side ?
 
Thank you so much for testing this. I downgraded two hosts to 5.10 and there is definitely a difference.

I'm still only getting around 5gbps with a single thread. But multiple threads perform much better.

I'm also seeing better performance on the VLAN bridge.

Here is a single thread test on VXLAN:

Code:
------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size: 1.39 MByte (default)
------------------------------------------------------------
[  3] local 10.22.22.1 port 34964 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.67 GBytes  4.86 Gbits/sec

But using multiple threads I get a lot more speed. Before I was always capped at 4gbps

Code:
------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  833 KByte (default)
------------------------------------------------------------
[  7] local 10.22.22.1 port 46146 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 46162 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 46134 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 46114 connected with 10.22.22.3 port 5001
[  5] local 10.22.22.1 port 46116 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 46132 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 46172 connected with 10.22.22.3 port 5001
[ 10] local 10.22.22.1 port 46176 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  2.05 GBytes  1.76 Gbits/sec
[  6]  0.0-10.0 sec  2.21 GBytes  1.90 Gbits/sec
[  8]  0.0-10.0 sec  2.35 GBytes  2.02 Gbits/sec
[  3]  0.0-10.0 sec  2.11 GBytes  1.81 Gbits/sec
[  9]  0.0-10.0 sec  2.06 GBytes  1.76 Gbits/sec
[  5]  0.0-10.0 sec  2.36 GBytes  2.02 Gbits/sec
[ 10]  0.0-10.1 sec  1.95 GBytes  1.66 Gbits/sec
[  4]  0.0-10.1 sec  1.51 GBytes  1.29 Gbits/sec
[SUM]  0.0-10.1 sec  16.6 GBytes  14.1 Gbits/sec
 
Ok thanks.
(Note that maybe you could be cpu limited with 1 thread for iperf client (just look if you see iperf at 100% cpu), this could explain why you have more bandwith with multiple threads)


I'll try to find servers with non mellanox card to see it's a mellanox driver bug or other kernel bug.
 
just tested with broadcom bnx2x with kernel 5.15, I don't see any problem, works like 5.10 with mellanox.

So maybe it's coming from mlx5 driver.

I'll try to contact mellanox dev or do a bissect the kernel from my side, but I'll take some time.
 
@Manih

I forgot to ask you: do you use intel or amd processor ?

in my case, it's amd processor, and it seem that (at least on bad 5.11, that disabling iommu is fixing it)

amd_iommu=off in /etc/default/grub..

(it's disabled by default on intel, maybe that's why it was working with broadcom nic as it's was intel server)
 
Hi, These are AMD Epyc processors.

I thought I would have to enable IOMMU specifically. But I see from dmesg that it seems to be enabled by default. I will disable it as you suggest and try the 5.15 kernel again.
 
It doesn't work quite as well on 5.15 with IOMMU off.

With just one thread I am back at 4gbps:

Code:
root@vxlantest:~# iperf -c 10.22.22.1
------------------------------------------------------------
Client connecting to 10.22.22.1, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 10.22.22.3 port 56888 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0441 sec  4.74 GBytes  4.05 Gbits/sec

But with multiple threads it is a bit faster than it was with IOMMU off. But still a lot slower than kernel 5.10

Code:
[  1] local 10.22.22.3 port 49672 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.1084 sec   831 MBytes   690 Mbits/sec
[  5] 0.0000-10.1280 sec   847 MBytes   702 Mbits/sec
[  6] 0.0000-10.1279 sec  1.02 GBytes   866 Mbits/sec
[  1] 0.0000-10.1382 sec  1.38 GBytes  1.17 Gbits/sec
[  4] 0.0000-10.1385 sec  1.22 GBytes  1.04 Gbits/sec
[  8] 0.0000-10.1582 sec   959 MBytes   792 Mbits/sec
[  2] 0.0000-10.1785 sec   806 MBytes   664 Mbits/sec
[  3] 0.0000-10.4103 sec   985 MBytes   793 Mbits/sec
[SUM] 0.0000-10.3749 sec  7.95 GBytes  6.58 Gbits/sec
 
It doesn't work quite as well on 5.15 with IOMMU off.

With just one thread I am back at 4gbps:

Code:
root@vxlantest:~# iperf -c 10.22.22.1
------------------------------------------------------------
Client connecting to 10.22.22.1, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 10.22.22.3 port 56888 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0441 sec  4.74 GBytes  4.05 Gbits/sec

But with multiple threads it is a bit faster than it was with IOMMU off. But still a lot slower than kernel 5.10

Code:
[  1] local 10.22.22.3 port 49672 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.1084 sec   831 MBytes   690 Mbits/sec
[  5] 0.0000-10.1280 sec   847 MBytes   702 Mbits/sec
[  6] 0.0000-10.1279 sec  1.02 GBytes   866 Mbits/sec
[  1] 0.0000-10.1382 sec  1.38 GBytes  1.17 Gbits/sec
[  4] 0.0000-10.1385 sec  1.22 GBytes  1.04 Gbits/sec
[  8] 0.0000-10.1582 sec   959 MBytes   792 Mbits/sec
[  2] 0.0000-10.1785 sec   806 MBytes   664 Mbits/sec
[  3] 0.0000-10.4103 sec   985 MBytes   793 Mbits/sec
[SUM] 0.0000-10.3749 sec  7.95 GBytes  6.58 Gbits/sec
Note that last 5.15 kernel have retbleed mitigation enabled, and I known it can impact a lot performance.
(maybe can you add "retbleed=off" in grub

Currently, I'm running kernel 5.13 + amd_iommu=off, I'm around 5,5gbits with 1 thread , and it's scale fine with multiple thread
 
Hi,

Retbleed definately has an effect. With IOMMU and retbleed off, I am getting around the same speed for single thread as on 5.10 and just a little bit less on multiple threads.

I guess we will just have to evaluate if the performance we are getting with retbleed mitigation on is sufficient.

Code:
------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  187 KByte (default)
------------------------------------------------------------
[  5] local 10.22.22.1 port 41682 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 41672 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 41696 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 41708 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 41732 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 41746 connected with 10.22.22.3 port 5001
[  7] local 10.22.22.1 port 41724 connected with 10.22.22.3 port 5001
[ 13] local 10.22.22.1 port 41748 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[  3]  0.0-10.0 sec   454 MBytes   381 Mbits/sec
[  4]  0.0-10.0 sec  1.12 GBytes   962 Mbits/sec
[  8]  0.0-10.0 sec   910 MBytes   764 Mbits/sec
[  7]  0.0-10.0 sec  1.03 GBytes   881 Mbits/sec
[ 13]  0.0-10.0 sec   610 MBytes   512 Mbits/sec
[  6]  0.0-10.0 sec   974 MBytes   816 Mbits/sec
[  9]  0.0-10.0 sec  2.87 GBytes  2.46 Gbits/sec
[SUM]  0.0-10.0 sec  9.10 GBytes  7.80 Gbits/sec
 
Hi,

Retbleed definately has an effect. With IOMMU and retbleed off, I am getting around the same speed for single thread as on 5.10 and just a little bit less on multiple threads.

I guess we will just have to evaluate if the performance we are getting with retbleed mitigation on is sufficient.

Code:
------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  187 KByte (default)
------------------------------------------------------------
[  5] local 10.22.22.1 port 41682 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 41672 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 41696 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 41708 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 41732 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 41746 connected with 10.22.22.3 port 5001
[  7] local 10.22.22.1 port 41724 connected with 10.22.22.3 port 5001
[ 13] local 10.22.22.1 port 41748 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[  3]  0.0-10.0 sec   454 MBytes   381 Mbits/sec
[  4]  0.0-10.0 sec  1.12 GBytes   962 Mbits/sec
[  8]  0.0-10.0 sec   910 MBytes   764 Mbits/sec
[  7]  0.0-10.0 sec  1.03 GBytes   881 Mbits/sec
[ 13]  0.0-10.0 sec   610 MBytes   512 Mbits/sec
[  6]  0.0-10.0 sec   974 MBytes   816 Mbits/sec
[  9]  0.0-10.0 sec  2.87 GBytes  2.46 Gbits/sec
[SUM]  0.0-10.0 sec  9.10 GBytes  7.80 Gbits/sec
ok, thanks for test. I think I'll disable retbleed for now, until a proper software mitigation is done (or once I'll have zen4 epyc )
 
I apologize for bringing an old thread back to life but, this seems relevant to the issue I'm seeing. I am using an EVPN between multiple VMs both on same host and across hosts. Communications between VMs on the same host see about a 20-25% throughput loss in comparison to host to host communications. (iperf3 between 2 hosts is CPU limited to ~25Gbit/s and iperf3 between 2 VMs on the same hosts appears to be CPU limited to ~18-20Gbit/s) However, when I communicate between two VMs across hosts the maximum throughput I see (both with iperf3 and real world) is 1.3-1.5Gbit/s.

Both hosts are Intel systems utilizing Mellanox CX4 100Gbit/s NICs for all intra-grid communications. The grid members are connected to a Mellanox SN2700 100Gbit/s switch running SONIC. These 2 systems also contain 2 4-port 10Gbit/s Intel NICs that are used to connect to upstream providers. As a result, I've enabled IOMMU and IOMMU pass-through.

Was a solution ever found for the extremely significant bandwidth issue with EVPN between hosts?
 
I apologize for bringing an old thread back to life but, this seems relevant to the issue I'm seeing. I am using an EVPN between multiple VMs both on same host and across hosts. Communications between VMs on the same host see about a 20-25% throughput loss in comparison to host to host communications. (iperf3 between 2 hosts is CPU limited to ~25Gbit/s and iperf3 between 2 VMs on the same hosts appears to be CPU limited to ~18-20Gbit/s) However, when I communicate between two VMs across hosts the maximum throughput I see (both with iperf3 and real world) is 1.3-1.5Gbit/s.

Both hosts are Intel systems utilizing Mellanox CX4 100Gbit/s NICs for all intra-grid communications. The grid members are connected to a Mellanox SN2700 100Gbit/s switch running SONIC. These 2 systems also contain 2 4-port 10Gbit/s Intel NICs that are used to connect to upstream providers. As a result, I've enabled IOMMU and IOMMU pass-through.

Was a solution ever found for the extremely significant bandwidth issue with EVPN between hosts?
Well, you have the vxlan overhead when you are groing between differents hosts. (when vm are on same host, they are not going through vxlan).

But 1,5gbit.....that's seem really really low. (I'm using sn2700 with cumulus && connectx4 or 5, I don't have special problem with bandwith).

The vxlan overhead is 50bytes for each frame.

Are you sure to not have an mtu problem ? you don't have enable ipsec encryption from the documentation ?
 
The vmbr interfaces and switch interfaces are 9100. The zone and EVPN are 9050. I had gone from having the VM interfaces match the EVPN MTU to making them 8900 and there has been no improvement. IPsec encryption is not enabled.

On the same host they SHOULD be going through the EVPN as that is their only communications interface. The traffic just bounces from virtual interface to virtual interface through the EVPN on host.

UPDATE: IPsec entries existed in the IPsec.conf file but, I didn't believe IPsec to be active. As I had ran out of ideas I removed the configs altogether and throughput increased a bit more than tenfold. That leads me to believe that IPsec was actually functioning without my knowledge AND the encryption was not being HW accelerated for some reason. I'll have to look into why it wasn't being HW accelerated on Xeon v3 processors. The algorithm was EAS. While the ~40% loss of throughput now is a massive improvement it is still a surprising penalty.
 
Last edited:
The vmbr interfaces and switch interfaces are 9100. The zone and EVPN are 9050. I had gone from having the VM interfaces match the EVPN MTU to making them 8900 and there has been no improvement. IPsec encryption is not enabled.
not a problem, but you can use also 9050 too for the vm interface mtu in the vm os.

On the same host they SHOULD be going through the EVPN as that is their only communications interface. The traffic just bounces from virtual interface to virtual interface through the EVPN on host.
Sorry, I want to said, they are not going out through vxlan interface (dataplane).
Evpn is the controlplane , learning mac/ip , and exchange them between hosts.
(But 2 vms in the same host, on the same vnet, directly communicate and in 2 different vnet, is routed by the host itself)


UPDATE: IPsec entries existed in the IPsec.conf file but, I didn't believe IPsec to be active. As I had ran out of ideas I removed the configs altogether and throughput increased a bit more than tenfold. That leads me to believe that IPsec was actually functioning without my knowledge AND the encryption was not being HW accelerated for some reason. I'll have to look into why it wasn't being HW accelerated on Xeon v3 processors. The algorithm was EAS. While the ~40% loss of throughput now is a massive improvement it is still a surprising penalty.
ok great :) anyway, if you are on your own network, I don't think you need cryto ?
Ipsec is really cpu limited (not very good balanced between core).
If encryption is really needed, something like wireguard could be better. (I'm planning to implement it for proxmox 8)