Looking for a "cheap" way to upgrade to 40gibt

And do I need to bring the ports on my switch up individually? The switch GUI shows that the ports with cables have cables but that they are "down"/"polling". So, do I need to "up" a port before it will connect?
Nah, it probably means your cards are in Ethernet mode rather than Infiniband mode. When you switch a card to Infiniband mode (and I think it can be done per-port) then the port in question will probably automatically show up on the switch.

It sounds like your switch has some kind of management interface too, which will probably be useful. That'll mean you probably have a subnet manager running on the switch, which is one less thing to get working. :)
 
Man, this is proving so much more complicated than I anticipated (and I did anticipate issues...)!

So mstconfig tells me

Code:
Device #1:
----------

Device type:    ConnectX3Pro   
Device:         65:00.0         

Configurations:                              Next Boot
         SRIOV_EN                            True(1)         
         NUM_OF_VFS                          32             
         LOG_BAR_SIZE                        3               
         BOOT_OPTION_ROM_EN_P1               True(1)         
         BOOT_VLAN_EN_P1                     False(0)       
         BOOT_RETRY_CNT_P1                   0               
         LEGACY_BOOT_PROTOCOL_P1             PXE(1)         
         BOOT_VLAN_P1                        1               
         BOOT_OPTION_ROM_EN_P2               True(1)         
         BOOT_VLAN_EN_P2                     False(0)       
         BOOT_RETRY_CNT_P2                   0               
         LEGACY_BOOT_PROTOCOL_P2             PXE(1)         
         BOOT_VLAN_P2                        1               
         IP_VER_P1                           IPv4(0)         
         IP_VER_P2                           IPv4(0)

Note how there is no line with link_type

I am starting to suspect that my Infiniband cards don't understand Infiniband.

mstflint tells me

Code:
Image type:            FS2
FW Version:            2.43.7028
FW Release Date:       12.1.2020
Product Version:       02.43.70.28
Rom Info:              type=PXE version=3.4.662
                       type=UEFI version=14.9.90 cpu=AMD64
Device ID:             4103
Description:           Node             Port1            Port2            Sys image
GUIDs:                 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
MACs:                                       e41d2ddebde0     e41d2ddebde1
VSD:                   
PSID:                  MT_1090111023

MT_1090111023 seems to refer to a ConnextX-3 Pro EN. And "EN" might mean "no Infiniband"??? Can someone who understands this, confirm either way?

I am starting to get the feeling that I would like to take my Infiniband cards out for a round of skeet :mad:
 
MT_1090111023 seems to refer to a ConnextX-3 Pro EN.
Uh oh!

Yep, that's definitely one of their ethernet only models. No infiniband capability on that model of card.

So, in a 3 node cluster you'll be able to hook all three nodes up to each other when using dual port cards. But you'd need to figure something extra out in order to have a spare network port for them to communicate to anything else (ie a backup server).

Sounds like you'll be needing to get some more cards, or perhaps re-organise your architecture for the backup server?
 
Oh FFS! How can there even be something like an IB card that can't do IB?

Anways, so does the "EN" signify cards that can't do IB? In other words, when I buy new cards, will I be safe, if I avoid "EN" cards? Is there anything else that I would want to avoid? Is there any value in buying "VPI" cards (I think my switch doesn't have the VPI profile)?
 
Ahhh. I'll explain the (super simplified) history, as that should help. :)

Mellanox seems to have started off making Infiniband only cards (various series called "Infinihost"). After a few years of Infiniband not taking over the world they figured out how to make "dual protocol" cards which can do both Infiniband and Ethernet, a capability they call Virtual Protocol Interconnect ("VPI"). These are the ones in the ConnectX VPI, ConnectX-2 VPI, ConnectX-3 VPI, (etc) series.

At some point in the ConnectX/ConnectX-2/ConnectX-3/(etc) series Mellanox decided to also release cards that only do Ethernet. These are the Connect-X EN/ConnectX-2 EN/ConnectX-3 EN/(etc) cards.

There seem to be "Pro" models in the ConnectX-3 range too, and yours seems to be ConnectX-3 Pro EN cards. They're actually pretty good cards... if you want to do just Ethernet. ;)
 
Last edited:
Is there any value in buying "VPI" cards (I think my switch doesn't have the VPI profile)?
Yeah, you'll be wanting the VPI cards. Those ones can have their ports switched to Infiniband, and they'll work fine with your switch. :)
 
Just remembered something relevant. At some point in the later ConnectX series(es) they also decided to release Infiniband only cards again.

Those ones are named a bit different too "Connect-IB". Those would probably work too (if it's a card with the right Infiniband speed), but they're generally pretty rare so you'll likely find more VPI adapters around at good prices. The VPI cards should work fine for you, though you definitely want to ensure you buy cards with the same Infiniband speed as your switch. :)
 
Took me a while to get the new cards and then a couple of other things diverted my focus.

But I have been gradually replacing my nodes and each new node got one of the new cards and now all nodes are equipped with true infiniband cards. One card it set to link type VPI and the other two are set to ETH. The one card set to VPI is being recognized by the switch, the other two are not. So far so good. Looks like I just need to switch the other two cards to VPI and all is peachy.

But here's the thing: The one set to VPI is not showing up as an ethernet card in PVE. So how do I use it now?
 
Okay, so I got all three nodes to talk to each other over the new infiniband network. And my switch reports that all nodes are connected via 56gbps links (although I didn't purchase the more expensive cables :D) So that's a win.

But the performance is underwhelming to say the least: 10gbps - that is what my old 10gbe network delivers as well. So no improvement at all.

The wiki entry mentioned in my previous post describes TCP/IP tuning but the iperf test performed afterwards (in the wiki entry) actually shows even more dismal results than I get without tuning: 7.71 Gbits/sec. What gives?
 
  • Like
Reactions: justinclift
But the performance is underwhelming to say the least: 10gbps - that is what my old 10gbe network delivers as well. So no improvement at all.
It's probably using the software based tcp/ip layer driver (IPoIB <-- "IP over Infiniband") to let things communicate without having to use native Infiniband protocols.

So your win is going to be more like "I have a cheap switch that lets me plug in a bunch of hosts and have them communicate" rather than "I can move much more data than 10GbE".

That being said, performance tuning of some variety is probably going to help. As long as you figure out what the right things to tune are. Someone mentioned that setting "connected mode" to ON isn't a thing with Infiniband any more, but I'd probably double check if that's really the case as it used to help things run at near line rate (back in the day). Well, from rough memory. :)
 
So your win is going to be more like "I have a cheap switch that lets me plug in a bunch of hosts and have them communicate" rather than "I can move much more data than 10GbE".

Yeah, well, I already have the expensive switch, so... And while the IB switch has way more ports than my 10gbe switch, I actually do need fewer of them because my IB cards only have two ports whereas my 10gbe cards have four.

And the whole purpose of the exercise was to move data faster. Moving the data at the same speed but with substantially more noise is not going to cut it for me :(

That being said, performance tuning of some variety is probably going to help. As long as you figure out what the right things to tune are. Someone mentioned that setting "connected mode" to ON isn't a thing with Infiniband any more, but I'd probably double check if that's really the case as it used to help things run at near line rate (back in the day). Well, from rough memory.

I will look into that.

It's probably using the software based tcp/ip layer driver (IPoIB <-- "IP over Infiniband") to let things communicate without having to use native Infiniband protocols.
You mean that using IPoIB inflicts a penalty of around 80% (10gbps vs 56gbps; not considering overhead)?

If that is the case, I need to try using IB directly instead of through IPoIB. I want those sweet 56gbps.
 
I have two potential explanations for the low throughout of my 40/56gbps IPoIB:

- There might be a bottleneck in the PCIe slots that my IB cards sit in (I have relatively small servers with only few PCI lanes and the available slots may have too few lanes to fully saturate the connections)

- It might be an (artificial) limitation on the part of the subnet manager: ibdiagnet, inter alia, gives this output:

Code:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

So the individual nodes are connected via 40gbps but still, as a group, are limited to 10gbps. I found this on the internet

https://forums.developer.nvidia.com/t/how-do-i-set-the-subnet-rate/206724/2

which details how to set the subnet rate - but only for opensm (which, if I understand correctly, is a subnet manager one can run on a node, if the switch doesn't have a subnet manager but I chose my SX6036 specifically because it does have its own SM, so I would prefer to use that one). Now I need to figure out whether my SX6036's SM defaults to 10gbps and, if it does, how I can change that.

It seems that I cannot change the default rate for the default subnet. But I can create a new subnet and specify a rate of, say, 56gbps. But then the question is how I get the nodes/cards to use that new subnet instead of the default one. Probably via the PKey parameter. Let's see...
 
  • Like
Reactions: justinclift
- There might be a bottleneck in the PCIe slots that my IB cards sit in (I have relatively small servers with only few PCI lanes and the available slots may have too few lanes to fully saturate the connections)
See with "lspci -vvv" for card bandwitdh and actual running bandwitdh in pci slot.
To member and group rate don't know but you should enable and use rdma access to get more throughput.
 
  • Like
Reactions: justinclift
...

Code:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

...
Oh, that 10Gbps could indeed be the thing you need to change.

As a data point, installing and running opensm should be trivially simple. I used to have it running on every node (small cluster), as it'll automatically select a "master" node and have the others go into standy mode with no issue. Configuring things like default rates and stuff (from memory) used to be pretty simple too, as it was just changing a few simple values in the default opensm config file.

I used to use CentOS and RHEL with this stuff (back in the day) and opensm was available as a standard package in the default repositories. The config files were under `/etc/something` back then. No idea where Debian puts them though. ;)
 
Last edited:
im getting about 40gbps on ipoib with the same hardware, you have a couple issues at play here.
1) mtu is not 4k
2) your lowest member is at 10gbps and not 40.

id get the output for:
- ibstat from each host
- ibnetdiscover
- ibdiagnet

and a screenshot if the ib switch sm settings, i do think the default for the subnet manager in ib switches is SDR which is 10gbps.
 
  • Like
Reactions: justinclift
Just a side note, depending on which Cx-3 card you got, they should support IB all the way up to 56gbps, but if you switch them to ethernet, some of them only support 10gb as an artificial firmware limitation. Firmware can be flashed over the card to the pro version to unlock the 40gbps eth if you want to go that route.

something to keep an eye out for if you decide to just switch it over to eth instead of IB.
 
Just a side note, depending on which Cx-3 card you got, they should support IB all the way up to 56gbps, but if you switch them to ethernet, some of them only support 10gb as an artificial firmware limitation. Firmware can be flashed over the card to the pro version to unlock the 40gbps eth if you want to go that route.

something to keep an eye out for if you decide to just switch it over to eth instead of IB.
Thanks for the hint!

I don't think my cards are limited in that way because I have them in eth mode and do reach throughput of 15 to 20 gbps that way.