[SOLVED] Slow 40GBit Infiniband on Proxmox 8.1.4

Jul 4, 2022
66
9
13
Poland
Hello folks,

I have 4 nodes cluster build on Dell r730 and r740.
I wanted to improve Ceph performance by installing faster network cards so I bought Mellanox CX314A cards and Mellanox SX6036 managed switch.
Everything is connected and running but OMG terribly slow, it's slower than 10GBit Ethernet I had on my Intel cards.
All I can get from iperf3 is 7,6GBit.
I've read a lot of posts but I cannot find anything useful, some say I should switch to Ethernet but at the moment I don't have a VPI license on my switch so Ethernet isn't working only Infiniband.
Has anyone properly working Infiniband configuration and could share some experience please?

EDIT: I've changed MTU to max 65520, after that I have transfers around 9.6Gbit still far away from 40.
 
Last edited:
  • Like
Reactions: Kingneutron
Hello folks,

I have 4 nodes cluster build on Dell r730 and r740.
I wanted to improve Ceph performance by installing faster network cards so I bought Mellanox CX314A cards and Mellanox SX6036 managed switch.
Everything is connected and running but OMG terribly slow, it's slower than 10GBit Ethernet I had on my Intel cards.
All I can get from iperf3 is 7,6GBit.
I've read a lot of posts but I cannot find anything useful, some say I should switch to Ethernet but at the moment I don't have a VPI license on my switch so Ethernet isn't working only Infiniband.
Has anyone properly working Infiniband configuration and could share some experience please?

EDIT: I've changed MTU to max 65520, after that I have transfers around 9.6Gbit still far away from 40.

Make sure iperf3 is same version on both sides, try -R reverse and ' -P # ' for # of Parallel transfers
 
i replicated this setup and first go around i run into the same numbers no matter how many parallels i threw at it. this indicated to me that i made a mistake in the config.

Code:
[SUM]   0.00-10.00  sec  9.42 GBytes  8.09 Gbits/sec  57324             sender

ibstatus showed no issues at they phy level:
Code:
Infiniband device 'ibp1s0' port 1 status:
        default gid:     fe80:0000:0000:0000:f452:1403:0091:0061
        base lid:        0x5
        sm lid:          0x3
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

finally i ran ibdiagnet and spotted my issue:

Code:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

oops, minor mistake, checked it again. SAME issue. next i checked the ibportstate:

Code:
ibportstate 5 1
CA/RT PortInfo:
# Port info: Lid 5 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................5
SMLid:...........................4
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........0
LinkSpeedExtEnabled:.............0
LinkSpeedExtActive:..............No Extended Speed
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
# MLNX ext Port info: Lid 5 port 1
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x00
LinkSpeedEnabled:................0x00
LinkSpeedActive:.................0x00

from here however, im not sure what to do to fix this as its 40gbps in the switch and OS (via ethtool).
 
Last edited:
  • Like
Reactions: itret
I ended up buying ETH license for switch and changing configuration to Ethernet, now it's faster and iperf3 shows average 33Gbps which I think is pretty much all I can get from ConnectX-3 cards.
That works, one thing i forgot to mention is that you set the MTU on the cards to 65520, but that wont work with your switch, the SX6036 has a max MTU size of 4096. if you correct that you might be able to get a little more performance out of it if its not fragmenting.
 
  • Like
Reactions: itret
That works, one thing i forgot to mention is that you set the MTU on the cards to 65520, but that wont work with your switch, the SX6036 has a max MTU size of 4096. if you correct that you might be able to get a little more performance out of it if its not fragmenting.
I believe this value is for Infiniband. I've lowered MTU on both switch and NIC to 9000. Also my ConnectX-3 CX314A are dual port so I'm thinking about LACP LAG.

1713853994592.png
Please correct me if I'm wrong
 
yeah, the CX-3 with IPoIB doesnt seem to be working right, or theres some tuning im not doing correctly. i plugged in the cx-4 EDR cards and they are doing great at 40gbps with IPoIB out of box.

probly best to leave it with eth since thats working and mark it solved.
 
  • Like
Reactions: Kingneutron
I ended up buying ETH license for switch and changing configuration to Ethernet, now it's faster and iperf3 shows average 33Gbps which I think is pretty much all I can get from ConnectX-3 cards.
How/where did you buy the license? I thought, you can't buy that anymore because the switch is EOL...
 
Well, since I couldn't find any other working solution for faster Infiniband with IBoIP. It seems I have to leave it as Ethernet. Marked as solved? :)
I haven't given up yet.

And I did find out something: So the default (group) rate for an IB subnet (at least as provided on my SX6036) seems to be 10gbps (that's what the ipdiagnet output suggests but I also found some link to corroborate where it said that the default setting of "3" for the group rate on the switch does correspond to 10gbps). Alas, I found no way to change the default subnet's settings.

But, I was able to create a new subnet of my own for which I could specify my own (group) rate. I chose 56gbps.

Now I needed to have my CX3 cards use that subnet as well. Not sure whether there are other ways to do this, but the one I found was to create a child interface on the CX3 card's (main) interface and assign it the desired new subnet's P_Key. That new child interface then needed its own IP address and after I did the same on one of the other nodes, they were able to communicate via the new subnet.

Now, for some reason, this did not give me a rate higher than 13gbps. It may be that I am, after all, limited by my PCIe slot's performance (and I shall investigate this lead) or that a CX3 card doesn't have enough throughput (which seems unlikely given that it delivers 33gbps when used in ETH mode). But I did break the 10gpbs barrier.

Interestingly, ibdiagnet now lists both subnets as follows:

Code:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
-I- Subnet: IPv4 PKey:0x7ffe QKey:0x00000b1b MTU:2048Byte rate:RRGbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:RRGbps

PKey 7fff is the default subnet, 7ffe is my new subnet (notice the creativity ;)). And its group rate is RRGbps - whatever that means (maybe 13gbps? :D)

I shall continue to experiment.
 
Last edited:
I haven't given up yet.

And I did find out something: So the default (group) rate for an IB subnet (at least as provided on my SX6036) seems to be 10gbps (that's what the ipdiagnet output suggests but I also found some link to corroborate where it said that the default setting of "3" for the group rate on the switch does correspond to 10gbps). Alas, I found no way to change the default subnet's settings.

But, I was able to create a new subnet of my own for which I could specify my own (group) rate. I chose 56gbps.

Now I needed to have my CX3 cards use that subnet as well. Not sure whether there are other ways to do this, but the one I found was to create a child interface on the CX3 card's (main) interface and assign it the desired new subnet's P_Key. That new child interface then needed its own IP address and after I did the same on one of the other nodes, they were able to communicate via the new subnet.

Now, for some reason, this did not give me a rate higher than 13gbps. It may be that I am, after all, limited by my PCIe slot's performance (and I shall investigate this lead) or that a CX3 card doesn't have enough throughput (which seems unlikely given that it delivers 33gbps when used in ETH mode). But I did break the 10gpbs barrier.

Interestingly, ibdiagnet now lists both subnets as follows:

Code:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
-I- Subnet: IPv4 PKey:0x7ffe QKey:0x00000b1b MTU:2048Byte rate:RRGbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:RRGbps

PKey 7fff is the default subnet, 7ffe is my new subnet (notice the creativity ;)). And its group rate is RRGbps - whatever that means (maybe 13gbps? :D)

I shall continue to experiment.
Just FYI, ChatGPT says it would be better to use RDMA instead of IPoIB., don't be shy ask it :-) And good luck. Myself... I gave up a long time ago :-)
 
Last edited:
So I got that Ethernet license as well and switched two of my three cards to Ethernet mode (and, of course, also the switch). With iperf I got a throughput of (close to) 30gbps. Not bad! Or so I thought. I switched also the last card to Ethernet mode and ... have never been able to reproduce the 30gbps. Now I consistently am getting less than 20gbps. Not that impressive. And I have no clue what changed. It looked so promising and suddenly I'm back at square one.

Just FYI, ChatGPT says it would be better to use RDMA instead of IPoIB., don't be shy ask it :-) And good luck.
So, switch back to IB mode and then - somehow - activate RDMA? Or does RDMA also work in Ethernet mode? Probably not.

Are you using RDMA? Do you know how to activate it?
 
apt install rdma-core
modprobe rdma
When using for nfs ... enable by outcomment rdma & rdma-port in /etc/nfs.conf, restart nfs server, mount with -o rdma