Failed to enable SRIOV for mellanox cx-4

kakan

New Member
Aug 12, 2023
2
0
1
Fail to enable sriov for cx-4 in PVE8, dmesg shows these. And it is work in PVE7.4, maybe cause by the mlx5_core. This link maybe useful, https://git.kernel.org/pub/scm/linu...c?id=6496357aa5f710eec96f91345b9da1b37c3231f6

[ 364.522176] mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 3317): QUERY_HCA_CAP(0x100) op_mod(0x40) failed, status bad parameter(0x3), syndrome (0x5add95), err(-22)
[ 364.559520] mlx5_core 0000:01:00.0: mlx5_device_enable_sriov:82:(pid 3317): failed to enable eswitch SRIOV (-22)
[ 364.559534] mlx5_core 0000:01:00.0: mlx5_sriov_enable:168:(pid 3317): mlx5_device_enable_sriov failed : -22
 
Hello All,

I have the same problem with Proxmox 8 latest version with Mellnox mlx5.

pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

Driver version:
driver: mlx5_core
version: 6.2.16-3-pve
firmware-version: 16.27.1016 (MT_0000000167)

When i tried to add VFs to "sriov_numvfs" file, i've received the same error above:
echo 8 > /sys/class/net/ens5f1np1/device/sriov_numvfs
-bash: echo: write error: Invalid argument

mlx5_core 0000:3b:00.1: mlx5_cmd_out_err:779:(pid 8554): QUERY_HCA_CAP(0x100) op_mod(0x40) failed, status bad parameter(0x3), syndrome (0x5add>
kernel: mlx5_core 0000:3b:00.1: mlx5_device_enable_sriov:82:(pid 8554): failed to enable eswitch SRIOV (-22)
kernel: mlx5_core 0000:3b:00.1: mlx5_sriov_enable:168:(pid 8554): mlx5_device_enable_sriov failed : -22

I checked the link above of replacing the lines for "eswitch.c?" file, but the file not exists.
/usr/lib/modules/6.2.16-3-pve/kernel/drivers/net/ethernet/mellanox/mlx5/core/


Did someone manage to solve this issue?

Many Thanks.
 
I have this issue as well, I remember doing it in PVE 7.x, but when I got it working months ago on a test computer with PVE 8 is was buggy and crashy, and I had other priorities so I moved on. I do have mstconfig set with the correct options.

I just tried again, and I get the same error `write error: Invalid argument`
when I checked DMESG immediatly after it had:
Code:
[< FIRST EVENT>] mlx5_core 0000:01:00.1: E-Switch: Enable: mode(LEGACY), nvfs(8), necvfs(0), active vports(9)
[<    0.108920>] pci 0000:01:01.2: [15b3:1016] type 7f class 0xffffff
[<    0.000003>] pci 0000:01:01.2: unknown header type 7f, ignoring device
[<    1.008057>] mlx5_core 0000:01:00.1: mlx5_sriov_enable:195:(pid 4212): pci_enable_sriov failed : -5
[<    0.001269>] mlx5_core 0000:01:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(8), necvfs(0), active vports(9)
[<    0.018664>] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(8), necvfs(0), active vports(1)

My next steps are:
  1. See if it works if applied at boot time. I know my i915 SR-IOV is not assignable via sysfs, but also, mlx5_core doesn't expose a module parameter to do this during init, I'd still be relying on sysfs and I'm not sure if sysfsctl gets treated any differently than user echo/tee to driver file
  2. I see a option to force the card to be an "pci eth" device in MSTCONFIG, I will try that but I have a feeling that itll remove IB/RDMA stuff, I am too dumb to know what to do with that tho, I just want SR-IOV to work
  3. Try to force the card to use mlx4_core which DOES have a module/kernel parameter for NUM_VFs
  4. Maybe I'm treating SR-IOV in a very legacy way, and the correct way to do is is thru actual network stack where its actually handled as a net device more so that PCIe

I am writing all of this out because I will probably never come back to this post, and thought maybe someone could figure out their problem with some of the steps I plan to try