Connect-x 3 ROCe Configuration

sirebral

Member
Feb 12, 2022
50
10
13
Oregon, USA
Hey all!

I've been looking for a step by step to configure RoCE on the latest Proxmox. I am using Connectx-3 dual port cards and an interconnect cable that supports both InfiniBand and Ethernet natively, no switch involved. Likewise, I have is set up for Ethernet on both ends, and it works fine for both 40 gig connections, yet I want to take advantage of the capability to skip the CPU for my shares between nodes. These shares will only be used for files, not for virtual instances, lots of media. I'll be testing out both NFS 4.2 and SMB after this is functional to see what works best on my rigs.

My Google-foo hasn't turned up much, I did get the instructions below from ChatGPT4, yet I tend not to trust AI quite yet to be 100% correct. I'm looking for a similar, yet confirmed step-by-step. After looking for hours at 50 different sites, I found 50 different answers. I'd really appreciate an assist, so I'm not further messing up my configs trying inaccurate how-to's.

From ChatGPT4:

//-------

RoCE (RDMA over Converged Ethernet) allows for efficient communication between nodes in a network using Remote Direct Memory Access (RDMA) over Ethernet. In order to set up RoCE on Proxmox 7.4 using ConnectX-3 cards without a switch, you will need to follow these steps:

1. Install the necessary drivers:
- ConnectX-3 cards require the `mlx4_en` driver to be installed. You can install it using the following command:

```
apt-get install -y linux-modules-extra-$(uname -r) mlx4_en
```

- Verify that the driver has been loaded using the following command:

```
lsmod | grep mlx4_en
```

2. Configure the network interface:
- Identify the network interface that corresponds to the ConnectX-3 card using the following command:

```
ip a
```

- Edit the network interface configuration file `/etc/network/interfaces` and add the following lines to configure the interface:

```
auto enp6s0f0
iface enp6s0f0 inet static
address <IP_ADDRESS>
netmask <NETMASK>
```

Replace `enp6s0f0` with the name of the network interface identified in the previous step, and replace `<IP_ADDRESS>` and `<NETMASK>` with the desired IP address and netmask for the interface.

- Restart the networking service using the following command:

```
systemctl restart networking
```

3. Configure RoCE:
- Edit the RoCE configuration file `/etc/modprobe.d/mlx4.conf` and add the following line:

```
options mlx4_core roce_mode=1
```

- Restart the `rdma` service using the following command:

```
systemctl restart rdma
```

4. Verify the RoCE connection:
- Use the `ibstat` command to verify that the ConnectX-3 card is connected to the RoCE network:

```
ibstat
```

The output should show a RoCE interface with a valid state and link layer.

- Use the `ibping` command to verify that the RoCE connection is working:

```
ibping -c 10 <IP_ADDRESS>
```

Replace `<IP_ADDRESS>` with the IP address of another node in the RoCE network. The output should show successful pings with low latency and jitter.

By following these steps, you should be able to set up RoCE on Proxmox 7.4 using ConnectX-3 cards without a switch.

------\\

Thanks in advance for the help!

Keith
 
I hadn't tried this one, yet it is missing steps, hence why I asked for help. I played with ib today with little success, going back to RoCE, much more in my wheelhouse.
 
There's an issue with RoCE, it's not an option for the Connectx-3 cards. NVIDIA has stopped including it in their newer builds. The latest build that does support it, referred to as LTS, supports Ubuntu 20.04 and won't install. They promised a refresh, yet it's about a year late. So, if you're looking at this card note it will work well as a cheap interconnect at 40 gig, yet you're not (at the moment) going to get the advantage of RDMA/RoCE.
 
There's an issue with RoCE, it's not an option for the Connectx-3 cards. NVIDIA has stopped including it in their newer builds. The latest build that does support it, referred to as LTS, supports Ubuntu 20.04 and won't install. They promised a refresh, yet it's about a year late. So, if you're looking at this card note it will work well as a cheap interconnect at 40 gig, yet you're not (at the moment) going to get the advantage of RDMA/RoCE.

Hi, How about Connectx-3 PRO cards? I take that is still O.K?
As far as I know the standard ones could do RoCE and PRO support RoCEv2.
What are the options? As I have the PRO and at the moment looking to start using ROCE and NVMe/RDMA or TPC.
Thanks.
 
I played with ib today with little success, going back to RoCE, much more in my wheelhouse.
Sorry for the late reply. Just seeing this now.

What is the issue or the error that you are getting when you are trying to get IB up and running?

Keep in mind that if you are using IB, you also need to be running the OpenSM subnet manager as well.

If you've already installed all of the IB/rdma (and related) packages and you've already enabled them via modprobe, and you have been able to assign at least an IPv4 address to the interface, then at least one of the two nodes will need to run # systemctl start opensm and then you should be able to run # ibstat to confirm that the link is up (or just ping each other to see whether the link is up).

I don't know if the Mellanox ConnectX-3 cards have it (I would assume that they should, but I am not 100% sure), but on my Mellanox ConnectX-4 cards, when the link is up, the LED that's on the board itself will change from amber to green, so you might need to pop around to take a peek at the back of your system to see whether the green light is on or not.

From there, if you want to test to make sure that the IB is connecting at the expected line speed, you can use # ib_send_bw to verify that.

I get something like 96.9 Gbps (out of a possible 100 Gbps) on my Mellanox ConnectX-4 100 Gbps IB cards.
 
So, if you're looking at this card note it will work well as a cheap interconnect at 40 gig, yet you're not (at the moment) going to get the advantage of RDMA/RoCE.
TL;DR -- IB works, if you have a subnet manager running, but just not with RDMA nor RoCE (so far/I think?)

So....a few things about this:

1) If you are trying to NFSoRDMA on the Proxmox host itself, the Debian version of the IB driver (rdma) does NOT support NFSoRDMA out of the box.

If you try to port and/or just use the procedure that is usually well defined for RHEL and/or its derivative (like the now defunct CentOS), where you would execute the following (this is from my CentOS deployment notes):

Code:
host# chkconfig rdma on

host# vi /etc/sysconfig/nfs

change RPCNFSDARGS= to RPCNFSDARGS="--rdma=20049"

save,quit

host# vi /etc/rdma/rdma.conf

XPRTRDMA_LOAD=yes
SVCRDMA_LOAD=yes

client# vi /etc/rdma/rdma.conf
XPRTRDMA_LOAD=yes

host# vi /etc/exports

/path/to/export/folder *(rw,[options])

host# service nfslock stop; service nfs stop; service portmap stop; umount /proc/fs/nfsd; service portmap start; service nfs start; service nfslock start

client# vi /etc/fstab

server:/path/to/nfs /mnt/point nfs defaults,rdma,port=20049,[options] 0 0

This snippet from my deployment notes works for CentOS, but at the time when I wrote about it on this thread, it didn't work for Ubuntu (which, being a Debian derivative, also didn't work for Proxmox neither, as it also runs Debian).

(I did find this thread through: https://forum.proxmox.com/threads/nfsordma.137623/, so this might be useful for you. I haven't tried these instructions at home yet though.)

TL;DR -- other than the last link there (which I haven't tested) -- I haven't been able to get NFSoRDMA working directly on the Debian based Proxmox host.

2) HOWEVER, you CAN get "vanilla" NFS working, albeit without RDMA/RoCE, and you can still get "decent"-ish performance, depending on what you're using for the storage media, underneath.

(I am using all HDDs, so I am limited as it is anyways.)

i.e. I can't hit 100 Gbps speeds with my 100 Gbps NICs because I definitely don't have anywhere close to being enough HDDs to be able to handle 12.5 GB/s writes.

3) It also depends on what else you are trying to do with it.

I set one of my ports on my dual VPI port card where LINK_TYPE = ETH (rather than IB), and then I tried to set up a Linux network bridge (ethernet) so that I can use it for my VMs and the results were quite disappointing. (I was getting ~23 Gbps via VM <-> VM run of iperf but that might be because technically, 100 Gbps IB = 4x 25 Gbps link (QSFP28), and it didn't know that or couldn't figure that out.)

But ib_send_bw showed that I can hit 96.9 Gbps on the interface/link itself.

4) I subscribed to the linux-rdma mailing list for a little while, and they were able to confirm that the Debian version of opensm does NOT support virtualization/virtual functions, and therefore; it wasn't like I could even create VFs to then hand out to VMs and/or LXCs, for them to use.

But that's more of a Debian thing than it is a Mellanox/IB/RDMA/RoCE thing.

(As a result, my compute nodes are now back to running CentOS 7.7.1908, bare metal, because I know that IB/RDMA works with that.)

5) Bonus, if you've made it this far --

I also tried setting up a CentOS VM, running inside a Proxmox host, whereby the VM <-> host communication was handled via virtiofs, and then the CentOS VM exported the shared folder as a NFSoRDMA export.

That experiment failed with a NFS lock/stale handle condition as it was trying to fight with said virtiofs to ascertain the state of the shared folder/system. Locally, the CentOS VM can modify the files on the host via virtiofs.

Remotely, however, it ran into lock contention issues, so that got abandoned as well.

Now, I am trying to see if there is a way for me to possibly/potentially use xcp-ng (because it is derived from RHEL), to see if I might have more luck with RDMA with that.

But that's a whole 'nother story, altogether.
 
the Debian version of the IB driver (rdma) does NOT support NFSoRDMA out of the box.
Are you referring to OFED, or the inbuilt drivers? Mellanox/nvidia dont support it, true, but I imagine it would be pretty trivial to apt install rdma-core and set the rdma switch in nfsd.conf. I have to imagine you tried that though...

I never had the need to squeeze those extra 10% so I never tried (plus I have mixed clients in terms of network interfaces)
 
Are you referring to OFED, or the inbuilt drivers? Mellanox/nvidia dont support it, true, but I imagine it would be pretty trivial to apt install rdma-core and set the rdma switch in nfsd.conf. I have to imagine you tried that though...

I never had the need to squeeze those extra 10% so I never tried (plus I have mixed clients in terms of network interfaces)
I've only been testing with the "in-box" driver that you can get from Debian because at least in 2019, Mellanox REMOVED the ability to use NFSoRDMA, from their OFED driver. (Source: https://forums.developer.nvidia.com/t/why-is-nfsordma-in-centos-7-6-1810-limited-to-10-gbps/206720).

Since then, they've added this capability back in, but also since then, I don't trust that Mellanox (read: Nvidia) won't take the feature away again, which is why I've only stuck with the "in-box" drivers (because the CentOS "in-box" supported NFSoRDMA, and therefore; was more reliable/stable.

Given that, you can configure the host for RDMA, but I have not been able to get my compute nodes to properly mount the NFS export, using the RDMA protocol, on port 20049 (which is the default port for RDMA).

Ubuntu can't mount it and IIRC, neither can my CentOS compute nodes. (This is a part of the reason why my VM <-> host communication is handled via virtiofs, where/when possible, and lxcfs for LXCs.)

As for the performance -- that can depend on a variety of factors. If your server is generally quite busy with other tasks, then having RDMA enabled means that, of course, it can pretty bypass the entire network stack, which means that it doesn't need to hit the CPU quite as hard. So, the performance gain MAY be quite substantial.

(For my HPC/CFD/FEA/CAE applications, within the application, I can see it use around 80 Gbps out of the possible 100 Gbps.)

Conversely, the other way to look at it is why NOT use the RDMA offload? If there a ways for me to lighten the load, on the CPU, so that it can be doing other things instead, I can't see why one won't want to leverage that, especially if a) it works, and b) the time it takes to deploy/enable this feature, isn't too terribly involved.
 
Are you referring to OFED, or the inbuilt drivers? Mellanox/nvidia dont support it, true, but I imagine it would be pretty trivial to apt install rdma-core and set the rdma switch in nfsd.conf. I have to imagine you tried that though...

I never had the need to squeeze those extra 10% so I never tried (plus I have mixed clients in terms of network interfaces)
Screenshot 2024-09-13 005516 edited.png

So....you were wondering whether RDMA was worth it or not.

I haven't tested deploying RDMA yet, but this is on my new Proxmox cluster (AMD Ryzen 9 5950X, AMD Ryzen 9 5950X, AMD Ryzen 9 7950X), where each node has 128 GB of RAM (DDR4, DDR4, DDR5 respectively), and they also all have a Mellanox Connect-X 4 (MCX456A-ECAT) dual port VPI 100 Gbps IB NIC), that's connected to my Mellanox 36-port externally managed 100 Gbps IB switch (MSB-7890), with my main Proxmox server running the opensm that comes with Debian.

All three nodes were deployed about 3 hours ago (running Proxmox 8.2.4) and Ceph Quincy (17.2.7).

The two 5950X systems has a Silicon Power US70 1 TB NVMe 4.0 x4 SSD in it whilst the 7950X has an Intel 670p 2 TB NVMe 3.0 (I think) x4 SSD in that.

I ran the same test again, but instead of a 10 GiB file, I tested it with a 50 GiB file and that ended up with 20.1 GB/s (160.8 Gbps) read back rather than the 20.4 GB/s (163.2 Gbps)

These tests were performed while I was also installing a Windows 11 VM in the background as well (as I am prepping for a live migration speed test).

*edit*
I set up an Ubuntu 24.04 VM with 64 GiB of RAM provisioned to it. With the 100 Gbps IB, I am able to live-migrate that VM to my other node in 10 seconds, at an average migration rate of 12.8 GiB/s (102.4 Gbps).

Screenshot 2024-09-13 030605.png

*edit #2*
Here is a live migration of a Win11 VM with 64 GiB of RAM provisioned to it. (16 GiB/s = 128 Gbps)

Screenshot 2024-09-13 033226.png
 
Last edited:
thanks for the data, but I'm not certain how this addresses the question "was wondering whether rdma was worth it."

parenthetically-
I set up an Ubuntu 24.04 VM with 64 GiB of RAM provisioned to it. With the 100 Gbps IB, I am able to live-migrate that VM to my other node in 10 seconds, at an average migration rate of 12.8 GiB/s (102.4 Gbps).
Compression. unless you have a way to fully load the vm ram with incompressible data this isnt all that impressive.
 
thanks for the data, but I'm not certain how this addresses the question "was wondering whether rdma was worth it."

parenthetically-

Compression. unless you have a way to fully load the vm ram with incompressible data this isnt all that impressive.
I don't think that there's much in the way of compression.

The VM itself has only seconds of uptime before I start the migration request, so it doesn't have time to compress what's in RAM.

Besides, we also know that in both instances, Windows, and DEFINITELY Ubuntu (or any Linux really), will start caching frequently accessed files to RAM unless you add vm.vfs_cache_pressure parameter in /etc/sysctl.conf to minimise and/or prevent it from doing so. In other words, caching to RAM (which still has a much faster interface than even 100 Gbps IB), still means that you have to push the data, onto the other node, over said 100 Gbps IB.

To the point or question about RDMA -- I found a source from Nvidia that talks about how to enable RDMA with Ceph, but then there are other forum posts here that says that Ceph does NOT support nor work with IB.

So it would appear that the jury seems to be out with this one.

Curiously enough though -- are you running HDR IB? 200 GbE at home? NDR IB/400 GbE? XDR IB/800 GbE? What are you running at home?
 
I don't think that there's much in the way of compression.
compression in flight. if you just booted the guest, ram is mostly zeros.
To the point or question about RDMA -- I found a source from Nvidia that talks about how to enable RDMA with Ceph, but then there are other forum posts here that says that Ceph does NOT support nor work with IB.
Yeah, I remember around 2018 was discussion on the subject, and some folks managed to get it working but it had to be compiled from source. the results were not encouraging at the time, but that has more to do with the inefficient mechanism ceph uses for OSDs. When Crimson finally ships that may be a different story.

As for IB... it depends on what you mean. ceph works just fine over IPOIB. That is how I originally built my original service cluster back in 2014. If you meant rdma, see above.

Curiously enough though -- are you running HDR IB?
Mix of QDR, FDR, and EDR. All ethernet now, sunsetted all my IB years ago.

What are you running at home?
I dont bring work home. Ever heard the joke "What does the Gynecologist say to his wife on their wedding night? Honey, I can't even look at you for less then $300."
 
compression in flight. if you just booted the guest, ram is mostly zeros.
If that was the case, then live-migrating the VM when it only has 16 GB of RAM, should be transferred significantly faster than the amount of time that it takes to live-migrate the VM when there's 64 GB of RAM provisioned to it.

But that's not the case.


As for IB... it depends on what you mean. ceph works just fine over IPOIB. That is how I originally built my original service cluster back in 2014. If you meant rdma, see above.
i.e. uses ib verbs for data transfers


Mix of QDR, FDR, and EDR. All ethernet now, sunsetted all my IB years ago.
I've had mix successes, but mostly failures when it comes with 100 GbE (i.e. trying to share the 100 GbE connect between VMs/LXCs didn't come anywhere close to the 100 Gbps capability/capacity.

I don't know if there is a way to create a Linux ethernet bridge where it will be able to recognise that the 100 Gbps link actually consists of 4x 25 Gbps links.


I dont bring work home.
This isn't work for me.

This is just for fun/shits and giggles.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!