40gbs Mellanox Infinityband

jw6677

Member
Oct 19, 2019
74
4
13
31
www.cayk.ca
Hey, Obviously I can't vouch for the below being the right fit for you, and frankly it likely install a bunch of crud you don't need, however, when I install a new node with a connectx-2 card, I run the below as part of my setup, and it just works. YMMV, and generally you may want to research each of these to determine if it makes sense for you, but this is out of my setup notes.


Code:
apt-get -y install mdadm libibverbs1 ibverbs-utils librdmacm1 rdmacm-utils libdapl2 ibsim-utils ibutils libcxgb3-1 libibmad5 libibumad3 libmlx4-1 libmthca1 libnes1 infiniband-diags mstflint opensm perftest srptools
apt-get -y -t buster-backports install libibverbs1 librdmacm1 libibmad5 libibumad3 librdmacm1 ibverbs-providers rdmacm-utils infiniband-diags libfabric1 ibverbs-utils
apt-get -y install libmlx4-1 infiniband-diags ibutils ibverbs-utils rdmacm-utils perftest

#Then add to /etc/network/interfaces:
# Insert your own device name in place of ibs4 all throughout:

auto ibs4
iface ibs4 inet static
        address  192.168.2.4
        netmask  255.255.255.0
        broadcast  192.168.2.255
        #hwaddress ether random
        mtu 65520
        pre-up modprobe ib_ipoib
        pre-up echo connected > /sys/class/net/ibs4/mode
 

ndroftheline

Active Member
Jun 17, 2017
33
13
28
35
awesome! can you show me output of ibstat on one of your systems having followed this? i think there might be something about IBoIP i've totally missed...
 

jw6677

Member
Oct 19, 2019
74
4
13
31
www.cayk.ca
Here you go!

Code:
CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.10.720
        Hardware version: b0
        Node GUID: 0x0002c903000c241a
        System image GUID: 0x0002c903000c241d
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 1
                LMC: 0
                SM lid: 6
                Capability mask: 0x0259086a
                Port GUID: 0x0002c903000c241b
                Link layer: InfiniBand
 

ndroftheline

Active Member
Jun 17, 2017
33
13
28
35
OK! So now I need to try and understand the difference betwen "Ethernet" mode and IPoIB. I thought they were the same thing, but this indicates they are not. I will follow your steps...

done! works brilliantly. however i didn't see much of an increase in file throughput between my nodes after switching from the 10g "ethernet mode" link and the 40g IPoIB link, just went from ~1.1GB/s to ~1.2 GB/s.

ibdiagnet output shows this warning:

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

fun! how to set my subnet group to 40gbps...
 

ndroftheline

Active Member
Jun 17, 2017
33
13
28
35
OK, so.

https://community.mellanox.com/s/question/0D51T00006s7aL4/how-do-i-set-the-subnet-rate - someone asking a few months ago about how to fix this, pointed me to where to find the opensm configuration and what document to look for the rate specifications, but it's on CentOS so config file is in a different location, and it sets to openSM defaults instead of specifying a particular group rate.

https://manpages.debian.org/jessie/opensm/opensm.8.en.html - showed where the default config file is in debian, which is /etc/opensm/partitions.conf

https://docs.mellanox.com/display/MLNXOFEDv451010/OpenSM#OpenSM-Partitions - confirms that the default opensm group rate is 10gbps (rate=3), and what the syntax is to change the rate

https://cw.infinibandta.org/document/dl/7859 - this is a direct link to the infiniband specification document discussed in the first link, i got it from this page: https://www.infinibandta.org/ibta-specifications-download/ - you';re looking for "the Architecture Specification Volume 1", table 224 that shows the rates avilable.

so I created /etc/opensm/partitions.conf and populated it with this:

Default=0x7fff, ipoib, mtu=5, rate=7, defmember=full : ALL=full, ALL_SWITCHES=full,SELF=full;

(note: i got most of that from the first link and only added rate=7. rate=7 corresponds to that table 224 entry for 40gbps. a quick systemctl restart opensm, and ibdiagnet showed 40gbps group rate.

however my transfer speeds remained the same. so i'm not sure it made a difference. i suspect i'm running into other types of limits, since i'm just doing a simple SMB share from a direct-connected windows 10 client.
 

gb00s

Member
Aug 4, 2017
28
2
23
42
@ndroftheline

How do you test your ib speed? If you are going the iperf test route you will not see any ib performance. I'm using the command < ib_send_bw > on the 'server' and < ib_send_bw IP-Server > on the 'client' machine. gives me this result:
root@pve2:~# ib_send_bw 192.168.1.71
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096
Link type : IB
Max inline data : 0
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x0212 PSN 0x7dca80
remote address: LID 0x02 QPN 0x0212 PSN 0xc1ba80
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2967.297000 != 2432.842000. CPU Frequency is not max.
65536 1000 3756.01 3755.94 0.052095
---------------------------------------------------------------------------------------
root@pve2:~#
while < iperf3 -c 192.168.1.71 -B 192.168.1.81 --cport 5500 > for instance gives me just normal 1Gb/s. And what is your iblinkinfo saying? Mine just says
root@pve2:~# iblinkinfo
CA: pve1 mlx4_0:
0x0021280001fc0ab3 2 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 1[ ] "pve2 mlx4_0" ( )
CA: pve2 mlx4_0:
0x0021280001fc09eb 1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 1[ ] "pve1 mlx4_0" ( )
which might explain the 1GB/s only. I'm still confused as both adapter rates are saying 40.
 
Last edited:

iamspartacus

New Member
Sep 9, 2020
21
0
1
39
I am a bit discouraged that there is no response of any kind... I hope my post will help someone else get this up and running..

I was able to get things working and will share steps below.

Although there is no driver for Debian 10, PX v6 seems to have modules needed to run these Mellanox devices, so you do not need official driver to make this work.

After you connect the device, it should be detected and show up on the list of network interfaces.
- If you do not see it on the list, try moving the device to another PCIe port, in some cases port configurations cause issues for these.

Although device has been detected by the host os, controller itself by default is running in "Infinityband mode" and not "Ethernet mode", we can change that using following command

connectx_port_config
It will produce following output:

Code:
ConnectX PCI devices :
|----------------------------|
| 1             0000:XX:00.0 |
|----------------------------|

Before port change:
Ib

|----------------------------|
| Possible port modes:       |
| 1: Infiniband              |
| 2: Ethernet                |
| 3: AutoSense               |
|----------------------------|
Select mode for port 1 (1,2,3):


It is showing that device 1 in this case is working in "Ib" mode and we can switch is to "Ethernet" mode by selection option 2.

One line version of same command is below, make sure to replace PCIe address with your device address:

Code:
echo eth >/sys/bus/pci/devices/0000:XX:00.0/mlx4_port1

Although interface is now properly configured, proxmox will not show it as active in the list of network interfaces, you will need to reboot networking, to do so run:

Code:
/etc/init.d/networking stop && /etc/init.d/networking start

You should now see Mellanod interface as active.

Next issue is that this change will not persist though reboot and configuration files changes I tried have no effect probably because driver process does not run - likely due to driver package not being installed.

So your network interface will default back to "Ib" on reboot and even if you change it you then have to restart networking


For now I created @reboot crontab that reconfigured the interface and then reboots then networking.

Code:
@reboot echo eth >/sys/bus/pci/devices/0000:XX:00.0/mlx4_port1
@reboot /etc/init.d/networking stop && /etc/init.d/networking start


If someone has better way to do this please post, I would love for controller config to be set before network interfaces are started.

I think it is possible to change firmware on the device to keep it in "eth" mode and I may look in to that.

What does one need to do if the command 'connectx_port_config' returns ' -bash: connectx_port_config: command not found' ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!