I figured that I might share this with the community here in regards to getting 100 Gbps Infiniband running:
(If you are doing this for < 100 Gbps, YMMV, but the basic idea should be very similar.)
Background:
There are two ways of deploying this -- you can either deploy it on the Proxmox VE host system itself (depending on how you plan on using your system), or you can passthrough your Mellanox IB card toONE* of your VMs, and then one of your VMs can have access to it, so it's up to you how you might want to try and use this.
(*Assumed that for dual port cards like the Mellanox ConnectX-5 dual port VPI (MCX456A-ECAT), when you passthrough the card, you passthrough the whole card. I haven't tried passing through individual ports yet.)
See edit below.
My hardware setup:
Proxmox VE host:
HP Z420 workstation
Intel Xeon E5-2690 (8 cores, 16 threads, 2.9 GHz base, max all core turbo 3.3 GHz, max turbo 3.6 GHz)
128 GB DDR3-1600 ECC Registered RAM
1x Samsung 850 EVO 1 TB SATA 6 Gbps SSD
1x HGST 3 TB SATA 7200 rpm HDD
Mellanox ConnectX-5 dual port 100 Gbps IB NIC
IB Switch:
Mellanox MSB-7890 36-port 100 Gbps externally managed switch
VM Client:
CentOS 7.7.1908
On your Proxmox VE host:
This should enable your device.
Check that the IP address can be defined
You should see it show up something like this:
If you see that, then you can assign an IPv4 address directly via the GUI (or do whatever it is that you need to).
If you need to passthrough the entire card to your VM, assuming that you have IOMMU enabled (cf. GPU passthrough for instructions on how to make sure that you have that enabled), you should be able to add the PCI Device to your VM.
Here is my GRUB_CMDLINE_LINUX_DEFAULT line:
The video stuff may not be necessary if you aren't doing GPU passthrough. (My GTX 980 passthrough didn't work anyways, but that's besides the point here.)
Here's my
When I run the
Peak of around 96.27 Gbps out of 100 Gbps -- not bad.
In reality though, if you don't have RDMA enabled, it is unlikely that you're going to hit that with any kind of non-NVMe storage device, and even then, trying to hit 100 Gbps in reality, is quite a tall order.
Sidebar:
I DID try and see if I can "share" the 100 Gbps bandwidth between VMs using the virtio network adapter (which Windows sees it as being a 10 Gbps NIC), but that didn't really seem to make a difference whether it was trying to go through the IB NIC or not.
iperf3 going from Windows -> CentOS was about 6.5 Gbps (but it was the same with the virtio NIC regardless of whether it was tied to the network bridge that was using the IB card or not) and going from CentOS -> Windows was about 8.5 Gbps.
I didn't test any of the more "advanced features" like SR-IOV, ISER, RDMA, RoCE, etc.
I suspect that if you can get it to work in the OS of your choice, then there is little reason why, with the PCIe passthrough - why it wouldn't work for your VM as well.
*edit*
I DID get around to testing where I would only pass through ONE of the two ports on my dual port card and it DOES work.
You will need to edit the
Interestingly enough though, and I can't tell if this is a limitation of the SSD that I was using or what, but even with that, I wasn't able to copy a 10 GB 7-zip file back and forth between the two VMs (which is connected to my 36-port 100 Gbps IB switch) at a rate any faster than maybe about 2.5 Gbps-ish.)
That was a part of the reason why I ran the ib_send_bw benchmark/test because normal just copying files -- wasn't able to show the system using the 100 Gbps line rates and iperf3 wasn't able to show that neither.
(Presumably having to go through the entire network stack vs. going through RDMA.)
But...bottom line....it DOES work.
Yay!!!
(If you are doing this for < 100 Gbps, YMMV, but the basic idea should be very similar.)
Background:
There are two ways of deploying this -- you can either deploy it on the Proxmox VE host system itself (depending on how you plan on using your system), or you can passthrough your Mellanox IB card to
See edit below.
My hardware setup:
Proxmox VE host:
HP Z420 workstation
Intel Xeon E5-2690 (8 cores, 16 threads, 2.9 GHz base, max all core turbo 3.3 GHz, max turbo 3.6 GHz)
128 GB DDR3-1600 ECC Registered RAM
1x Samsung 850 EVO 1 TB SATA 6 Gbps SSD
1x HGST 3 TB SATA 7200 rpm HDD
Mellanox ConnectX-5 dual port 100 Gbps IB NIC
IB Switch:
Mellanox MSB-7890 36-port 100 Gbps externally managed switch
VM Client:
CentOS 7.7.1908
On your Proxmox VE host:
Code:
# apt install -y infiniband-diags opensm ibutils rdma-core rdmacm-utils
# modprobe ib_umad
# modprobe ib_ipoib
This should enable your device.
Check that the IP address can be defined
Code:
# ip a
You should see it show up something like this:
Code:
3: ibs5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:2b:1e:ce brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:0
0:ff:ff:ff:ff
altname ibp4s0f0
inet 10.0.1.160/24 scope global ibs5f0
valid_lft forever preferred_lft forever
inet6 fe80::268a:703:2b:1ece/64 scope link
valid_lft forever preferred_lft forever
If you see that, then you can assign an IPv4 address directly via the GUI (or do whatever it is that you need to).
If you need to passthrough the entire card to your VM, assuming that you have IOMMU enabled (cf. GPU passthrough for instructions on how to make sure that you have that enabled), you should be able to add the PCI Device to your VM.
Here is my GRUB_CMDLINE_LINUX_DEFAULT line:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream nofb nomodeset textonly video=vesafb:off video=efifb:off"
The video stuff may not be necessary if you aren't doing GPU passthrough. (My GTX 980 passthrough didn't work anyways, but that's besides the point here.)
Here's my
/etc/modules
:
Code:
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
Code:
/etc/modprobe.d/blacklist.conf
blacklist nvidiafb
blacklist nvidia
blacklist nouveau
blacklist radeon
Code:
/etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1
Code:
/etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1
Code:
/etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist nvidia
blacklist nouveau
blacklist radeon
When I run the
ib_send_bw
benchmark, these are the results that I get:Peak of around 96.27 Gbps out of 100 Gbps -- not bad.
In reality though, if you don't have RDMA enabled, it is unlikely that you're going to hit that with any kind of non-NVMe storage device, and even then, trying to hit 100 Gbps in reality, is quite a tall order.
Sidebar:
I DID try and see if I can "share" the 100 Gbps bandwidth between VMs using the virtio network adapter (which Windows sees it as being a 10 Gbps NIC), but that didn't really seem to make a difference whether it was trying to go through the IB NIC or not.
iperf3 going from Windows -> CentOS was about 6.5 Gbps (but it was the same with the virtio NIC regardless of whether it was tied to the network bridge that was using the IB card or not) and going from CentOS -> Windows was about 8.5 Gbps.
I didn't test any of the more "advanced features" like SR-IOV, ISER, RDMA, RoCE, etc.
I suspect that if you can get it to work in the OS of your choice, then there is little reason why, with the PCIe passthrough - why it wouldn't work for your VM as well.
*edit*
I DID get around to testing where I would only pass through ONE of the two ports on my dual port card and it DOES work.
You will need to edit the
/etc/pve/qemu-server/<<VMID>>.conf
file where instead of passing through, for example 04:00
(i.e. the entire PCIe device), that you will add the specific port PCIe device ID (i.e. 04:00.0) and then you can passthrough one port to one VM and then you can also passthrough the second port to another VM.Interestingly enough though, and I can't tell if this is a limitation of the SSD that I was using or what, but even with that, I wasn't able to copy a 10 GB 7-zip file back and forth between the two VMs (which is connected to my 36-port 100 Gbps IB switch) at a rate any faster than maybe about 2.5 Gbps-ish.)
That was a part of the reason why I ran the ib_send_bw benchmark/test because normal just copying files -- wasn't able to show the system using the 100 Gbps line rates and iperf3 wasn't able to show that neither.
(Presumably having to go through the entire network stack vs. going through RDMA.)
But...bottom line....it DOES work.
Yay!!!
Last edited: