[TUTORIAL] Proxmox 8 Mellanox Infiniband and SR-IOV

Apr 20, 2024
31
7
8
Here is how i was able to get proxmox working with Infiniband and SR-IOV.

Hardware used is a mellanox switch (sx6036) and a mellanox Cx-4 100gbps EDR dual (or single) port card. Make sure the firmware is latest.

AS FAR AS IM AWARE, THIS WILL NOT WORK WITH OPENSM AND MUST HAVE A MELLANOX SWITCH FOR SR-IOV. on your switch, enable SM and virtualization, then restart the SM service by toggling it on or off, or rebooting the switch.
- ib sm enable
- ib sm virt enable
- configuration write

once that is configured you need to set up IOMMU and enable SRIOV on the hardware, this process will change depending on hardware so it will not be covered in this tutorial, the proxmox IOMMU config can be done by following this:
https://pve.proxmox.com/wiki/PCI(e)_Passthrough

once that is completed you will want to install the following packages:
apt install -y infiniband-diags ibutils rdma-core rdmacm-utils mstflint
check for link, nodes and run diags:
  • ibstat - MAKE NOTE OF THE HCA HERE
  • ibnodes
  • ibdiagnet

Identifying the mellanox card bus and querying it:
lspci | grep -i mellanox
mstflint -d <bus id here> q

enable SRIOV and 4 VF's (or however many you want).
mstconfig -d <bus id here> set SRIOV_EN=1 NUM_OF_VFS=4

Next steps are courtesy of Jose-d:
vim /etc/systemd/system/mellanox_initvf.service

paste in the following, MAKE SURE TO UPDATE THE HCA:
Code:
[Unit]
After=network.target

[Service]
Type=oneshot
# note: change according to your hardware:
ExecStart=/bin/bash -c "/usr/bin/echo 4 > /sys/class/infiniband/<HCA HERE>/device/sriov_numvfs"
ExecStart=/usr/local/bin/initIbGuids.sh
StandardOutput=journal
TimeoutStartSec=60
RestartSec=60

[Install]
WantedBy=multi-user.target

Now enable the service:
systemctl enable mellanox_initvf.service

Next we will create the script:
vim /usr/local/bin/initIbGuids.sh

Paste in the following:
Code:
#!/bin/bash

first_dev=$(ibstat --list_of_cas | head -n 1)

node_guid=$(ibstat ${first_dev} | grep "Node GUID" | cut -d ':' -f 2 | xargs | cut -d 'x' -f 2)
port_guid=$(ibstat ${first_dev} | grep "Port GUID" | cut -d ':' -f 2 | xargs | cut -d 'x' -f 2)

echo "first dev: $first_dev"
echo "node guid: $node_guid"
echo "port_guid: $port_guid"

if ip link show $first_dev &> /dev/null ; then
  for vf in {0..3}; do
    vf_guid=$(echo "${port_guid::-5}cafe$((vf+1))" | sed 's/..\B/&:/g')
    echo "vf_guid for vf $vf is $vf_guid"
    ip link set dev ${first_dev} vf $vf port_guid ${vf_guid}
    ip link set dev ${first_dev} vf $vf node_guid ${vf_guid}
    ip link set dev ${first_dev} vf $vf state auto
  done
fi

make sure you change permissions for the file:
chmod 777 /usr/local/bin/initIbGuids.sh

now SR-IOV is configured and ready to attach to the VM.
1714775432382.png

From the VM we can now see full link on the SR-IOV device and it is in an active state:
1714775570023.png
 
Last edited:
Thank you, this is working for me with Cx6 200gbps, two comments:

- It's working with opensm, service only needs to be restarted.
- mstconfig -d <bus id here> set SRIOV_EN=1 NUM_OF_VFS=4 (SRIOV_EN=1 is not accepted).

Why are you creating this service btw, wouldn't it not be enough to add the echo to the ibs4 up in interfaces?
 
Last edited:
Glad opensm service is working, i didnt test it since i already have a switch and couldnt find any confirmation online if it did or not.

I checked and this is working:
- mstconfig -d <bus id here> set SRIOV_EN=1 NUM_OF_VFS=4 (SRIOV_EN=1 is not accepted).
1716868240537.png

I'm using a service as opposed to running it when the interface is up due to some stability issues i had doing it that way, 1/4 reboots seemed to just not work, but when i made it a service its worked every time without issue.
 
Glad opensm service is working, i didnt test it since i already have a switch and couldnt find any confirmation online if it did or not.

I checked and this is working:

View attachment 68895

I'm using a service as opposed to running it when the interface is up due to some stability issues i had doing it that way, 1/4 reboots seemed to just not work, but when i made it a service its worked every time without issue.
Scratch that, must have been a typo, SRIOV_EN=1 is fine.....
 
For anyone interested, here's a rough script that does it all.
I made it more dynamic so you can set your desired quantity of VFs.

1. Edit the variable IB_DEVICE_NUM_VFS="8", simply replace the number with what you want.
2. Run the script
3. Reboot
4. Run the script again
5. That's it

The dependancy to run the script twice is the need to have the Mellanox card VFs applied after reboot before running second part of the script.
Enjoy

Bash:
#!/bin/bash
#
# Simple script to set VFS on Mellanox InfiniBand Network card


# 1x variable to set
# How many VFS do you want, simply change this number.
IB_DEVICE_NUM_VFS="8"



################
# Script Below #
################



# Instal IB packages
install_ib_packages()
  {
    apt install infiniband-diags ibutils rdma-core rdmacm-utils mstflint
  }



# Set SRVIO & VFS on firmware
set_srvio_vfs()
  {
    # IB device PCI
    IB_DEVICE_PCI=$(lspci | grep -i mellanox | grep -iv virtual | awk {'print $1'})

    if
      mstconfig -d $IB_DEVICE_PCI q SRIOV_EN | grep -i true && \
      [[ $(mstconfig -d $IB_DEVICE_PCI q NUM_OF_VFS | grep NUM_OF_VFS | awk {'print $2'}) == $IB_DEVICE_NUM_VFS ]];
    then
      continue
    else
      mstconfig -d $IB_DEVICE_PCI set SRIOV_EN=1 NUM_OF_VFS=$IB_DEVICE_NUM_VFS  
      echo "Need to reboot, then run script again"
      exit
    fi
  }



create_service_script()
  {
    # IB device ID
    IB_DEVICE_ID=$(ibstat --list_of_cas | head -n 1)


    # Create service script
    cat <<EOF > /etc/systemd/system/mellanox_initvf.service
[Unit]
After=network.target

[Service]
Type=oneshot
# note: change according to your hardware:
ExecStart=/bin/bash -c "/usr/bin/echo $IB_DEVICE_NUM_VFS > /sys/class/infiniband/$IB_DEVICE_ID/device/sriov_numvfs"
ExecStart=/usr/local/bin/initIbGuids.sh
StandardOutput=journal
TimeoutStartSec=60
RestartSec=60

[Install]
WantedBy=multi-user.target
EOF
  }


enable_service_script()
  {
    # enable service
    systemctl enable mellanox_initvf.service
  }

create_ip_link_script()
  {
# Create init script
    cat <<'EOF' > /usr/local/bin/initIbGuids.sh
#!/bin/bash

first_dev=$(ibstat --list_of_cas | head -n 1)

node_guid=$(ibstat ${first_dev} | grep "Node GUID" | cut -d ':' -f 2 | xargs | cut -d 'x' -f 2)
port_guid=$(ibstat ${first_dev} | grep "Port GUID" | cut -d ':' -f 2 | xargs | cut -d 'x' -f 2)

echo "first dev: $first_dev"
echo "node guid: $node_guid"
echo "port_guid: $port_guid"

if ip link show $first_dev &> /dev/null ; then
  for vf in $(ip link show $first_dev | grep vf | awk {'print $2'}); do
    vf_guid=$(echo "${port_guid::-5}cafe$((vf+1))" | sed 's/..\B/&:/g')
    echo "vf_guid for vf $vf is $vf_guid"
    ip link set dev ${first_dev} vf $vf port_guid ${vf_guid}
    ip link set dev ${first_dev} vf $vf node_guid ${vf_guid}
    ip link set dev ${first_dev} vf $vf state auto
  done
fi
EOF

    # Change permissions
    chmod 777 /usr/local/bin/initIbGuids.sh
  }


initialize_interfaces()
  {
    # Initialize interfaces
    /usr/local/bin/initIbGuids.sh
  }



#################
# Run Functions #
#################
install_ib_packages
set_srvio_vfs
create_service_script
enable_service_script
create_ip_link_script
initialize_interfaces
 
  • Like
Reactions: jamesthetechie
Hi,

I have issues getting ROCE to work and I'm using Cx-4 cards without any switches. PING works fine but RPING or anything in the RDMA layer doesn't seem to work and I get a segmentation fault.

root@host02gen10:~# rdma_client: start
Segmentation fault
root@host02gen10:~#

root@host02gen10:~# rdma link show
link rocep134s0f0/1 state ACTIVE physical_state LINK_UP netdev ens5f0np0
link rocep134s0f1/1 state ACTIVE physical_state LINK_UP netdev ens5f1np1
link rocep134s0f0v4/1 state DOWN physical_state DISABLED netdev ens5f0v4
link rocep134s0f1v4/1 state DOWN physical_state DISABLED netdev ens5f1v4
root@host02gen10:~#

I've setup 5 VFs per port and setup a similar startup service. One thing I do see in the VMs is that node_guid is not populated, so I tried the above and it still isn't populated. Appreciate any advice.

Thanks