Mellanox ConnectX-4 LX and brigde-vlan-aware on Proxmox 8.0.1

joxx75

Member
Nov 6, 2022
4
7
8
Hi everyone! This is my first post here so bear with me. I hope I don't break too many rules but I didn't find any guidelines.

I've been looking for a solution to enable bridge-vlan-aware on Mellanox ConnectX-4 LX (MCX4121A-ACAT, firmware 14.32.1010) on Proxmox 8.0.1 (6.2.16-4-pve) with the inbox driver. With bridge-vlan-aware enabled I didn't receive any network traffic. Didn't test the Mellanox OFED driver but this kernel isn't supported yet by it anyway. With vlan-aware enabled I could see traffic being sent out correctly and reaching my switch but traffic didn't get back unless the interface was set to promiscuous mode, leading me to look into configuration of the Mellanox card if any of the many many options caused traffic to be filtered. I tried toggling some possibly relevant options available via ethtool and via mlxconfig with no success.

Reading the mlx5 driver documentation I came across bridge offloading that mentioned eswitch and switchdev mode. https://docs.kernel.org/next/networking/device_drivers/ethernet/mellanox/mlx5.html#bridge-offload

Checking my system I saw that eswitch on my Mellanox card was set to legacy mode, and I tested to set it to switchdev, and traffic started to flow.
Code:
# devlink dev eswitch show pci/0000:01:00.0
pci/0000:01:00.0: mode legacy inline-mode none encap-mode basic
# devlink dev eswitch set pci/0000:01:00.0 mode switchdev
# devlink dev eswitch show pci/0000:01:00.0
pci/0000:01:00.0: mode switchdev inline-mode link encap-mode basic

I don't know why this is, this is too advanced for me.

Looking to enable this at boot time I came across this Intel document (I don't use SR-IOV VFs though so that part isn't relevant)
https://edc.intel.com/content/www/u...itchdev-mode-with-linux-bridge-configuration/

They enabled it first after the bridge was created and the physical ports was added, which I guess corresponds to after the bridge interface has been brought up. I tried doing it earlier than this with no success.

I currently have things working at boot with this config in /etc/network/interfaces. Verified on a second host with an identical card that I hadn't messed around with during testing so I don't think I've changed anything else.

Code:
auto vmbr0
iface vmbr0 inet manual
        bridge-ports enp1s0f0np0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 10-520
        # 0000:01:00.0 corresponds to enp1s0f0np0
        post-up devlink dev eswitch set pci/0000:01:00.0 mode switchdev

auto vlan20
iface vlan20 inet static
        address 192.168.20.3/24
        vlan-raw-device vmbr0

I needed vlan-aware since in my homelab I wanted to try running Gluster/CTDB/Samba directly on the hosts and my VM's needed to be able to communicate with the CTDB public ips on the same vlan. Without bridge-vlan-aware enabled I couldn't have any VM with a nic attached to the same vlan (vmbr0 vlan 20) that CTDB/Samba on the host used, it broke the networking.

I hope this will help someone else stuck in my situation, or perhaps having a better fix for this problem than a post-up command.
 
@joxx75 i've just update FW (latest is from year 2020) on my mellanox connectx-4 416-bcat and now Vlan aware bridge is working as it should (traffic is passing).
We have cards from different families and I use the newest available firmware for my card. I haven't looked further into this or tried with newer software yet in case the problem with legacy mode has been resolved.
 
We have cards from different families and I use the newest available firmware for my card. I haven't looked further into this or tried with newer software yet in case the problem with legacy mode has been resolved.
By different family you mean ConnectX-4 LX and ConnectX-4 EN ?
 
Thank you for this post! Switching to "switchdev" worked. I got Few varieties of Mellanox Connectx4/ConnectX4Lx and all of them having this issue. I'm on the latest firmware on all of my cards and on Proxmox 8.
 
Thanks for this working fix :)

I've been using my ConnectX-4 LX for a while just never with VLANs

And to my dismay i couldn't get it to work :(
 
It seems like this is also the 'fix' to my issue.

My configuration includes a bridge which has a 'stacked' vlan interface for cluster communication. I was unable to get any traffic flowing over this vlan device unless a vm was running on the 'lower' bridge or I ran tcpdump on the bridge / manually set it to promiscuous mode.

Configuring the physical interface without any vlan or bridge in between works without any modification.

I am using a Mellanox ConnectX-5 (MCX512A-ACAT-ML). My configuration is something like this:

Code:
[...]
auto enp1s0f0np0
iface enp1s0f0np0 inet manual
# MCX512A-ACAT-ML P1

auto enp1s0f1np1
iface enp1s0f1np1 inet manual
# MCX512A-ACAT-ML P2

auto vmbr0
iface vmbr0 inet manual
    bridge-ports enp1s0f0np0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 33,34,35
    post-up devlink dev eswitch set pci/0000:01:00.0 mode switchdev

auto vmbr0.33
iface vmbr0.33 inet static
    address 10.33.100.10/24
#CLUSTER
[...]

Bash:
mstflint -d 01:00.0 q
Image type:            FS4
FW Version:            16.35.3502
FW Release Date:       27.12.2023
Product Version:       16.35.3502

Thanks @joxx75 !

If someone can help explain 'why' and 'if' this is the correct way I'd be pleased to learn. The internet so far has not much information about this issue rather than 'install some proprietary NVIDIA driver, update your cards fw, it might fix it'[...].
 
Last edited:
Thanks @joxx75!

Also needed this fix in my 8.1.4 setup comprising MCX512A-ACAT (FW: 16.35.3006) and MCX4121A-XCAT (FW: 14.32.1010) cards.
 
To bring this back to life - I'm not sure if anyone has really solved this issue.

Setting the e-switch to switchdev carries some implications with it, not all of which are clear to me. At a bare minimum, it makes VFs unable to be vlan filtered. Mellanox mentions a VGT vs VST mode, and that VST mode would solve this - but I can't find any further discussion on this mode or any instructions on how to set it. Switchdev also seems to build several vport interfaces which seems undesirable, they are tied to the root PCI bus device, not like additional VFs/PFs.

There is a second solution, which is if the interface is in promiscuous mode but in legacy e-switch mode, it will also work. Running promiscuous mode for sure has implications which aren't great either though.

This *looks* like a kernel bug that was introduced somewhere around 6.1 (6.5?) to me. There was a commit to the mlx5_core which was said to be making promiscuous mode more efficient. It seems like this patch may have made it so that rx-vlan-filter off only works while you are also in promiscuous mode, instead of independently.

Has anyone had success with this in 6.5+ kernels, or knows all the proper implications & configurations for running the eswitch in switchdev mode?
 
  • Like
Reactions: Bent
To verify that it seems like a mlx5 driver bug in the kernel, I just tried the mlnx_en debian 12.1 driver from Mellanox: mlnx-en-24.01-0.3.3.1-debian12.1-x86_64

With no configuration changes/etc, everything behaves normally while in legacy/default eswitch mode. VLAN traffic passes without promiscuous mode.
 
To verify that it seems like a mlx5 driver bug in the kernel, I just tried the mlnx_en debian 12.1 driver from Mellanox: mlnx-en-24.01-0.3.3.1-debian12.1-x86_64

With no configuration changes/etc, everything behaves normally while in legacy/default eswitch mode. VLAN traffic passes without promiscuous mode.
Is there a solution for Proxmox 8.2? Since its based on Debian 12.5 the drivers are not yet built I assume?
 
  • Like
Reactions: Ozeki
I do not fully understand all the implications of running mode switchdev, but I can tell that it significantly increased maximum network throughput (12 gbps to 23) in my setup (A --> openwrt VM with one port of mellanox pcie-passthrough --> vlan-aware linux bridge --> other mellanox port --> B)

weirdly traffic flowing in the other direction was able to saturate the full link throughput in either mode (and remains faster also with mode switchdev, it can fully saturate the link with a single iperf3 stream whereas a single stream in the direction shown above maxes out at 14gbps)
 
Another issue is that I am getting a bunch of these errors on my MCX512A-ACAT-ML cards in switchdev mode, which seems (?) to have no further effects yet spamming the system log.

Bash:
Jul 21 19:25:06 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:25:06 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:25:06 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:25:12 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:26:11 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:26:11 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)
Jul 21 19:26:11 node01 kernel: mlx5_core 0000:81:00.0 enp129s0f0np0: failed (err=-22) to add object (id=2)

Anyone else getting these errors?
 
To verify that it seems like a mlx5 driver bug in the kernel, I just tried the mlnx_en debian 12.1 driver from Mellanox: mlnx-en-24.01-0.3.3.1-debian12.1-x86_64

With no configuration changes/etc, everything behaves normally while in legacy/default eswitch mode. VLAN traffic passes without promiscuous mode.
I also tested on pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-3-pve) using the official NVIDIA EN Driver for Linux which technically is not oficially working on Debian > 12.1. It installs successfully.

1721584942003.png

The NVIDIA EN Driver for Linux can be installed like this:

Bash:
# check sha256sum
sha256sum mlnx-en-24.04-0.7.0.0-debian12.1-x86_64.tgz

# extract and switch to directory
tar xzf mlnx-en-24.04-0.7.0.0-debian12.1-x86_64.tgz
cd mlnx-en-24.04-0.7.0.0-debian12.1-x86_64

# install without distro check (WARNING)
./install --skip-distro-check

# restart card with new NVIDIA EN Driver (WARNING Connection will be lost!)
/etc/init.d/mlnx-en.d restart

# after waiting about 2 minutes you should see the new driver in use
root@node03 ~ # ethtool -i <enXXXXXXXXXXXX>
driver: mlx5_core
version: 24.04-0.7.0
firmware-version: 16.35.4030 (MT_0000000080)
expansion-rom-version:
bus-info: 0000:85:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# Tested after updating the card to FW 16.35.4030

After running this driver, I could remove the switchdev workaround from before and VLAN traffic was still being passed over the vm-bridge without any other tweaks (!).

Bash:
[...]
auto enp133s0f0np0
iface enp133s0f0np0 inet manual
#MCX512A-ACAT-ML P1

auto vmbr0
iface vmbr0 inet manual
    bridge-ports enp133s0f0np0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#    post-up devlink dev eswitch set pci/0000:85:00.0 mode switchdev
[...]

Yet, likely due to this driver not being officially supported on Debian > 12.1 (or the respective kernel version) I am getting loads of kernel errors (which is to be expected) and dropped packages / bad performance. So we must either wait for upstream fixes or official NVIDIA support.

Bash:
[  +0.000083]  k10temp ipmi_msghandler mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775_core hwmon_vid nfsd vfio_pci auth_rpcgss vfio_pci_core irqbypass nfs_acl vfio_iommu_type1 lockd grace vfio iommufd efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 ib_uverbs macsec ib_core hid_generic usbmouse usbhid hid rndis_host cdc_ether usbnet mii crc32_pclmul nvme xhci_pci xhci_pci_renesas i2c_designware_pci i2c_ccgx_ucsi ahci tg3 libahci nvme_core xhci_hcd i2c_piix4 nvme_auth wmi [last unloaded: mlxfw]
[  +0.005662] CPU: 1 PID: 10170 Comm: kworker/1:3 Tainted: P    B   W  OE      6.8.8-3-pve #1
[  +0.000315] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.8 02/27/2024
[  +0.000310] Workqueue: events mlx5e_rx_cache_reduce_work [mlx5_core]
[  +0.000381] Call Trace:
[  +0.000300]  <TASK>
[  +0.000296]  dump_stack_lvl+0x76/0xa0
[  +0.000303]  dump_stack+0x10/0x20
[  +0.000372]  bad_page+0x76/0x120
[  +0.000400]  free_page_is_bad_report+0x86/0xa0
[  +0.000353]  free_unref_page_prepare+0x279/0x3d0
[  +0.000427]  free_unref_page+0x34/0x140
[  +0.000391]  __folio_put+0x3c/0x90
[  +0.000316]  mlx5e_page_release_dynamic+0x17b/0x290 [mlx5_core]
[  +0.000465]  mlx5e_rx_cache_reduce_clean_pending+0x44/0x80 [mlx5_core]
[  +0.000397]  mlx5e_rx_cache_reduce_work+0x4f/0xa0 [mlx5_core]
[  +0.000386]  process_one_work+0x16d/0x350
[  +0.000339]  worker_thread+0x306/0x440
[  +0.000323]  ? __pfx_worker_thread+0x10/0x10
[  +0.000337]  kthread+0xf2/0x120
[  +0.000367]  ? __pfx_kthread+0x10/0x10
[  +0.000394]  ret_from_fork+0x47/0x70
[  +0.000415]  ? __pfx_kthread+0x10/0x10
[  +0.000306]  ret_from_fork_asm+0x1b/0x30
[  +0.000305]  </TASK>
[  +0.000335] BUG: Bad page state in process kworker/1:3  pfn:149070a
[  +0.000408] page:000000003b7b01cc refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x149070a
[  +0.000365] flags: 0x17ffffe0000000(node=0|zone=2|lastcpupid=0x3fffff)
[  +0.000385] page_type: 0xffffffff()
[  +0.000446] raw: 0017ffffe0000000 dead000000000040 ffff891a7d527800 0000000000000000
[  +0.000355] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
[  +0.000327] page dumped because: page_pool leak
[  +0.000397] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs netfs rbd libceph ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_mark xt_physdev xt_addrtype xt_comment xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics 8021q garp mrp bonding mlx5_core(OE) mlxfw(OE) psample mlxdevm(OE) mlx_compat(OE) tls pci_hyperv_intf softdog binfmt_misc nfnetlink_log nfnetlink zram ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel snd_hda_codec_hdmi sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd snd_hda_intel cryptd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep acpi_ipmi snd_pcm snd_timer ucsi_ccg ipmi_si ast snd ipmi_devintf typec_ucsi joydev input_leds rapl wmi_bmof pcspkr typec i2c_algo_bit ccp soundcore ptdma
[  +0.000110]  k10temp ipmi_msghandler mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775_core hwmon_vid nfsd vfio_pci auth_rpcgss vfio_pci_core irqbypass nfs_acl vfio_iommu_type1 lockd grace vfio iommufd efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 ib_uverbs macsec ib_core hid_generic usbmouse usbhid hid rndis_host cdc_ether usbnet mii crc32_pclmul nvme xhci_pci xhci_pci_renesas i2c_designware_pci i2c_ccgx_ucsi ahci tg3 libahci nvme_core xhci_hcd i2c_piix4 nvme_auth wmi [last unloaded: mlxfw]
[  +0.006046] CPU: 1 PID: 10170 Comm: kworker/1:3 Tainted: P    B   W  OE      6.8.8-3-pve #1
[  +0.000310] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.8 02/27/2024
[  +0.000309] Workqueue: events mlx5e_rx_cache_reduce_work [mlx5_core]
[  +0.000378] Call Trace:
[  +0.000299]  <TASK>
[  +0.000322]  dump_stack_lvl+0x76/0xa0
[  +0.000300]  dump_stack+0x10/0x20
[  +0.000298]  bad_page+0x76/0x120
[  +0.000303]  free_page_is_bad_report+0x86/0xa0
[  +0.000304]  free_unref_page_prepare+0x279/0x3d0
[  +0.000305]  free_unref_page+0x34/0x140
[  +0.000303]  __folio_put+0x3c/0x90
[  +0.000364]  mlx5e_page_release_dynamic+0x17b/0x290 [mlx5_core]
[  +0.000466]  mlx5e_rx_cache_reduce_clean_pending+0x44/0x80 [mlx5_core]
[  +0.000441]  mlx5e_rx_cache_reduce_work+0x4f/0xa0 [mlx5_core]
[  +0.000508]  process_one_work+0x16d/0x350
[  +0.000416]  worker_thread+0x306/0x440
[  +0.000418]  ? __pfx_worker_thread+0x10/0x10
[  +0.000414]  kthread+0xf2/0x120
[  +0.000380]  ? __pfx_kthread+0x10/0x10
[  +0.000387]  ret_from_fork+0x47/0x70
[  +0.000415]  ? __pfx_kthread+0x10/0x10
[  +0.000419]  ret_from_fork_asm+0x1b/0x30
[  +0.000385]  </TASK>
 
Last edited:
Appreciations to joxx75, I'd like to confirm that swithdev mode works fine for me, bridge got to work correctly, traffic flows fine to vms through Mellanox ConnectX-4 LX, Proxmox 8.2.2.
That topic saved me much time!
 
The builtin linux driver should really get fixed though - annoying to lose network every time a kernel update comes
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!