Slow speeds using EVPN

Manih · Sep 27, 2022

Hello all,

I have been testing the SDN network feature and successfully set up a EVPN zone using the steps outlined in the documentation: https://pve.proxmox.com/pve-docs/chapter-pvesdn.html#pvesdn_zone_plugin_evpn

Everything seems to be working. But the speed I am getting between hosts is a bit slower than I expected. I am using bonded 25G mellanox adapters, that should have VXLAN offloading so I would think they should reach similar speeds as with VLANs.

Does anyone have experience with this. Are these speeds normal? Or, is there something that I am missing?

Here is the speed I am getting between two VMs with a single thread with only VLAN encapsulation:


# iperf -c 172.24.210.239
------------------------------------------------------------
Client connecting to 172.24.210.239, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  1] local 172.24.210.154 port 49416 connected with 172.24.210.239 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0167 sec  8.03 GBytes  6.89 Gbits/sec

On the same two VMs I am getting a bit less


~# iperf -c 10.22.22.1
------------------------------------------------------------
Client connecting to 10.22.22.1, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  1] local 10.22.22.3 port 36520 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0251 sec  4.81 GBytes  4.12 Gbits/sec

As far as I can tell, all offloading options that might matter for VXLAN are enabled:


~# ethtool -k bond0
Features for bond0:
rx-checksumming: off [fixed]
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: on
    tx-tcp-mangleid-segmentation: on
    tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: off
tx-udp-segmentation: on
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

Manih · Sep 27, 2022

I forgot to mention that I have adjusted the MTU on the host to 9000 and the interface on the VMs to 1500.

The host network is like:

Interfaces(ens10f0 and ens10f1)->bond0->vmbr0

Code:

2: ens10f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
3: ens10f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000

7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000

10: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000

42: vxlan_testnet: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master testnet state UNKNOWN mode DEFAULT group default qlen 1000

And on the VMs:

root@vxlantest:~# ip link show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000

spirit · Sep 27, 2022

Hi, I'm doing test with connect-x4,
I'm currently algo getting around 4gbit/s from vm to vm.

I'm seeing 1 ksoftirqd process on host doing some spike to 40% of 1 core. That's seem curious.

Doing the same iperf2 test from hypervisor directly, using -P2, I'm able to reach 20gbit/s without problem,
and no ksoftirqd at all.

for testing, between the 2 hosts, I'm doing a direct vxlan interface, without any bridge

host1

Code:

ip link add name vxlan16 type vxlan id 16 dev bond0 remote <ipofhost2> dstport 4789
ifconfig vxlan16 192.168.60.10/24 up
ifconfig vxlan16

host2

Code:

ip link add name vxlan16 type vxlan id 16 dev bond0 remote <ipofhost1> dstport 4789
ifconfig vxlan16 192.168.60.11/24 up
ifconfig vxlan16

then iperf between 192.168.60.X

I'll try to do more tests

spirit · Sep 27, 2022

I have done same test, with create a veth pair on hypervisor, plugged on the bridge, with ip address on the veth interface.
(so no virtualisation involved).

I have same problem, around 4-5 gbits/s with ksoftirqd high.

So , it's seem than offloading is not working when bridge is forwarding packet to a logical interface.

I will dig into the kernel documentation.

Manih · Sep 27, 2022

Hi, that is interesting. I haven't tried using a bridge on the host.

I might try to install the newest drivers from mellonox (mlnx-en). But I'm not sure which package to try since proxmox is using debian but with ubuntu kernel. Have you tried installing those?

spirit · Sep 27, 2022

just tested with kernel 5.10, it seem to working fine.

kernel 5.13->5.19 have bad performance.

I'm going to try 5.11, but it's look like a regression. (not sure if it's from mellanox driver or something other in kernel)

spirit · Sep 28, 2022

kernel 5.11 is slow.

So, something seem to have changed here. (I have seen some change in mlx5 driver 5.11 with offloading, but reverted in recent 5.15 /5.19).

Could you try to install kernel 5.10 to compare on your side ?

Manih · Sep 28, 2022

Thank you so much for testing this. I downgraded two hosts to 5.10 and there is definitely a difference.

I'm still only getting around 5gbps with a single thread. But multiple threads perform much better.

I'm also seeing better performance on the VLAN bridge.

Here is a single thread test on VXLAN:

Code:

------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size: 1.39 MByte (default)
------------------------------------------------------------
[  3] local 10.22.22.1 port 34964 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.67 GBytes  4.86 Gbits/sec

But using multiple threads I get a lot more speed. Before I was always capped at 4gbps

Code:

------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  833 KByte (default)
------------------------------------------------------------
[  7] local 10.22.22.1 port 46146 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 46162 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 46134 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 46114 connected with 10.22.22.3 port 5001
[  5] local 10.22.22.1 port 46116 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 46132 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 46172 connected with 10.22.22.3 port 5001
[ 10] local 10.22.22.1 port 46176 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  2.05 GBytes  1.76 Gbits/sec
[  6]  0.0-10.0 sec  2.21 GBytes  1.90 Gbits/sec
[  8]  0.0-10.0 sec  2.35 GBytes  2.02 Gbits/sec
[  3]  0.0-10.0 sec  2.11 GBytes  1.81 Gbits/sec
[  9]  0.0-10.0 sec  2.06 GBytes  1.76 Gbits/sec
[  5]  0.0-10.0 sec  2.36 GBytes  2.02 Gbits/sec
[ 10]  0.0-10.1 sec  1.95 GBytes  1.66 Gbits/sec
[  4]  0.0-10.1 sec  1.51 GBytes  1.29 Gbits/sec
[SUM]  0.0-10.1 sec  16.6 GBytes  14.1 Gbits/sec

spirit · Sep 28, 2022

Ok thanks.
(Note that maybe you could be cpu limited with 1 thread for iperf client (just look if you see iperf at 100% cpu), this could explain why you have more bandwith with multiple threads)

I'll try to find servers with non mellanox card to see it's a mellanox driver bug or other kernel bug.

spirit · Sep 28, 2022

just tested with broadcom bnx2x with kernel 5.15, I don't see any problem, works like 5.10 with mellanox.

So maybe it's coming from mlx5 driver.

I'll try to contact mellanox dev or do a bissect the kernel from my side, but I'll take some time.

spirit · Sep 28, 2022

@Manih

I forgot to ask you: do you use intel or amd processor ?

in my case, it's amd processor, and it seem that (at least on bad 5.11, that disabling iommu is fixing it)

amd_iommu=off in /etc/default/grub..

(it's disabled by default on intel, maybe that's why it was working with broadcom nic as it's was intel server)

Manih · Sep 28, 2022

Hi, These are AMD Epyc processors.

I thought I would have to enable IOMMU specifically. But I see from dmesg that it seems to be enabled by default. I will disable it as you suggest and try the 5.15 kernel again.

Manih · Sep 28, 2022

It doesn't work quite as well on 5.15 with IOMMU off.

With just one thread I am back at 4gbps:

Code:

root@vxlantest:~# iperf -c 10.22.22.1
------------------------------------------------------------
Client connecting to 10.22.22.1, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 10.22.22.3 port 56888 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0441 sec  4.74 GBytes  4.05 Gbits/sec

But with multiple threads it is a bit faster than it was with IOMMU off. But still a lot slower than kernel 5.10

Code:

[  1] local 10.22.22.3 port 49672 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.1084 sec   831 MBytes   690 Mbits/sec
[  5] 0.0000-10.1280 sec   847 MBytes   702 Mbits/sec
[  6] 0.0000-10.1279 sec  1.02 GBytes   866 Mbits/sec
[  1] 0.0000-10.1382 sec  1.38 GBytes  1.17 Gbits/sec
[  4] 0.0000-10.1385 sec  1.22 GBytes  1.04 Gbits/sec
[  8] 0.0000-10.1582 sec   959 MBytes   792 Mbits/sec
[  2] 0.0000-10.1785 sec   806 MBytes   664 Mbits/sec
[  3] 0.0000-10.4103 sec   985 MBytes   793 Mbits/sec
[SUM] 0.0000-10.3749 sec  7.95 GBytes  6.58 Gbits/sec

spirit · Sep 28, 2022

Manih said:

It doesn't work quite as well on 5.15 with IOMMU off.

With just one thread I am back at 4gbps:

Code:

root@vxlantest:~# iperf -c 10.22.22.1
------------------------------------------------------------
Client connecting to 10.22.22.1, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 10.22.22.3 port 56888 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0441 sec  4.74 GBytes  4.05 Gbits/sec

But with multiple threads it is a bit faster than it was with IOMMU off. But still a lot slower than kernel 5.10

Code:

[  1] local 10.22.22.3 port 49672 connected with 10.22.22.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.1084 sec   831 MBytes   690 Mbits/sec
[  5] 0.0000-10.1280 sec   847 MBytes   702 Mbits/sec
[  6] 0.0000-10.1279 sec  1.02 GBytes   866 Mbits/sec
[  1] 0.0000-10.1382 sec  1.38 GBytes  1.17 Gbits/sec
[  4] 0.0000-10.1385 sec  1.22 GBytes  1.04 Gbits/sec
[  8] 0.0000-10.1582 sec   959 MBytes   792 Mbits/sec
[  2] 0.0000-10.1785 sec   806 MBytes   664 Mbits/sec
[  3] 0.0000-10.4103 sec   985 MBytes   793 Mbits/sec
[SUM] 0.0000-10.3749 sec  7.95 GBytes  6.58 Gbits/sec

Note that last 5.15 kernel have retbleed mitigation enabled, and I known it can impact a lot performance.
(maybe can you add "retbleed=off" in grub

Currently, I'm running kernel 5.13 + amd_iommu=off, I'm around 5,5gbits with 1 thread , and it's scale fine with multiple thread

Manih · Sep 29, 2022

Hi,

Retbleed definately has an effect. With IOMMU and retbleed off, I am getting around the same speed for single thread as on 5.10 and just a little bit less on multiple threads.

I guess we will just have to evaluate if the performance we are getting with retbleed mitigation on is sufficient.

Code:

------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  187 KByte (default)
------------------------------------------------------------
[  5] local 10.22.22.1 port 41682 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 41672 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 41696 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 41708 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 41732 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 41746 connected with 10.22.22.3 port 5001
[  7] local 10.22.22.1 port 41724 connected with 10.22.22.3 port 5001
[ 13] local 10.22.22.1 port 41748 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[  3]  0.0-10.0 sec   454 MBytes   381 Mbits/sec
[  4]  0.0-10.0 sec  1.12 GBytes   962 Mbits/sec
[  8]  0.0-10.0 sec   910 MBytes   764 Mbits/sec
[  7]  0.0-10.0 sec  1.03 GBytes   881 Mbits/sec
[ 13]  0.0-10.0 sec   610 MBytes   512 Mbits/sec
[  6]  0.0-10.0 sec   974 MBytes   816 Mbits/sec
[  9]  0.0-10.0 sec  2.87 GBytes  2.46 Gbits/sec
[SUM]  0.0-10.0 sec  9.10 GBytes  7.80 Gbits/sec

spirit · Oct 1, 2022

Manih said:

Hi,

Retbleed definately has an effect. With IOMMU and retbleed off, I am getting around the same speed for single thread as on 5.10 and just a little bit less on multiple threads.

I guess we will just have to evaluate if the performance we are getting with retbleed mitigation on is sufficient.

Code:

------------------------------------------------------------
Client connecting to 10.22.22.3, TCP port 5001
TCP window size:  187 KByte (default)
------------------------------------------------------------
[  5] local 10.22.22.1 port 41682 connected with 10.22.22.3 port 5001
[  3] local 10.22.22.1 port 41672 connected with 10.22.22.3 port 5001
[  4] local 10.22.22.1 port 41696 connected with 10.22.22.3 port 5001
[  6] local 10.22.22.1 port 41708 connected with 10.22.22.3 port 5001
[  8] local 10.22.22.1 port 41732 connected with 10.22.22.3 port 5001
[  9] local 10.22.22.1 port 41746 connected with 10.22.22.3 port 5001
[  7] local 10.22.22.1 port 41724 connected with 10.22.22.3 port 5001
[ 13] local 10.22.22.1 port 41748 connected with 10.22.22.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[  3]  0.0-10.0 sec   454 MBytes   381 Mbits/sec
[  4]  0.0-10.0 sec  1.12 GBytes   962 Mbits/sec
[  8]  0.0-10.0 sec   910 MBytes   764 Mbits/sec
[  7]  0.0-10.0 sec  1.03 GBytes   881 Mbits/sec
[ 13]  0.0-10.0 sec   610 MBytes   512 Mbits/sec
[  6]  0.0-10.0 sec   974 MBytes   816 Mbits/sec
[  9]  0.0-10.0 sec  2.87 GBytes  2.46 Gbits/sec
[SUM]  0.0-10.0 sec  9.10 GBytes  7.80 Gbits/sec

ok, thanks for test. I think I'll disable retbleed for now, until a proper software mitigation is done (or once I'll have zen4 epyc )

cwathan · May 25, 2023

I apologize for bringing an old thread back to life but, this seems relevant to the issue I'm seeing. I am using an EVPN between multiple VMs both on same host and across hosts. Communications between VMs on the same host see about a 20-25% throughput loss in comparison to host to host communications. (iperf3 between 2 hosts is CPU limited to ~25Gbit/s and iperf3 between 2 VMs on the same hosts appears to be CPU limited to ~18-20Gbit/s) However, when I communicate between two VMs across hosts the maximum throughput I see (both with iperf3 and real world) is 1.3-1.5Gbit/s.

Both hosts are Intel systems utilizing Mellanox CX4 100Gbit/s NICs for all intra-grid communications. The grid members are connected to a Mellanox SN2700 100Gbit/s switch running SONIC. These 2 systems also contain 2 4-port 10Gbit/s Intel NICs that are used to connect to upstream providers. As a result, I've enabled IOMMU and IOMMU pass-through.

Was a solution ever found for the extremely significant bandwidth issue with EVPN between hosts?

spirit · May 25, 2023

cwathan said:
I apologize for bringing an old thread back to life but, this seems relevant to the issue I'm seeing. I am using an EVPN between multiple VMs both on same host and across hosts. Communications between VMs on the same host see about a 20-25% throughput loss in comparison to host to host communications. (iperf3 between 2 hosts is CPU limited to ~25Gbit/s and iperf3 between 2 VMs on the same hosts appears to be CPU limited to ~18-20Gbit/s) However, when I communicate between two VMs across hosts the maximum throughput I see (both with iperf3 and real world) is 1.3-1.5Gbit/s.

Both hosts are Intel systems utilizing Mellanox CX4 100Gbit/s NICs for all intra-grid communications. The grid members are connected to a Mellanox SN2700 100Gbit/s switch running SONIC. These 2 systems also contain 2 4-port 10Gbit/s Intel NICs that are used to connect to upstream providers. As a result, I've enabled IOMMU and IOMMU pass-through.

Was a solution ever found for the extremely significant bandwidth issue with EVPN between hosts?

Well, you have the vxlan overhead when you are groing between differents hosts. (when vm are on same host, they are not going through vxlan).

But 1,5gbit.....that's seem really really low. (I'm using sn2700 with cumulus && connectx4 or 5, I don't have special problem with bandwith).

The vxlan overhead is 50bytes for each frame.

Are you sure to not have an mtu problem ? you don't have enable ipsec encryption from the documentation ?

cwathan · May 25, 2023

The vmbr interfaces and switch interfaces are 9100. The zone and EVPN are 9050. I had gone from having the VM interfaces match the EVPN MTU to making them 8900 and there has been no improvement. ~~IPsec encryption is not enabled.~~

On the same host they SHOULD be going through the EVPN as that is their only communications interface. The traffic just bounces from virtual interface to virtual interface through the EVPN on host.

UPDATE: IPsec entries existed in the IPsec.conf file but, I didn't believe IPsec to be active. As I had ran out of ideas I removed the configs altogether and throughput increased a bit more than tenfold. That leads me to believe that IPsec was actually functioning without my knowledge AND the encryption was not being HW accelerated for some reason. I'll have to look into why it wasn't being HW accelerated on Xeon v3 processors. The algorithm was EAS. While the ~40% loss of throughput now is a massive improvement it is still a surprising penalty.

spirit · May 26, 2023

cwathan said:
The vmbr interfaces and switch interfaces are 9100. The zone and EVPN are 9050. I had gone from having the VM interfaces match the EVPN MTU to making them 8900 and there has been no improvement. ~~IPsec encryption is not enabled.~~

not a problem, but you can use also 9050 too for the vm interface mtu in the vm os.

cwathan said:
On the same host they SHOULD be going through the EVPN as that is their only communications interface. The traffic just bounces from virtual interface to virtual interface through the EVPN on host.

Sorry, I want to said, they are not going out through vxlan interface (dataplane).
Evpn is the controlplane , learning mac/ip , and exchange them between hosts.
(But 2 vms in the same host, on the same vnet, directly communicate and in 2 different vnet, is routed by the host itself)

cwathan said:
UPDATE: IPsec entries existed in the IPsec.conf file but, I didn't believe IPsec to be active. As I had ran out of ideas I removed the configs altogether and throughput increased a bit more than tenfold. That leads me to believe that IPsec was actually functioning without my knowledge AND the encryption was not being HW accelerated for some reason. I'll have to look into why it wasn't being HW accelerated on Xeon v3 processors. The algorithm was EAS. While the ~40% loss of throughput now is a massive improvement it is still a surprising penalty.

ok great

anyway, if you are on your own network, I don't think you need cryto ?
Ipsec is really cpu limited (not very good balanced between core).
If encryption is really needed, something like wireguard could be better. (I'm planning to implement it for proxmox 8)

Slow speeds using EVPN

Member

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Distinguished Member

Member

Member

Distinguished Member

Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member