mellanox infiniband passthrough to an LXC container

luciandf

Member
Feb 20, 2020
23
0
21
41
Hello all,

I was wondering if it was possible to passthrough (I know this is not the right term because there is no such thing for lxcs) a mellanox infiniband card to an lxc container and maintain the low latency communication in fiber mode?

Or a different way of asking the question, if the lxc shares the host kernel when using the infiniband card in the lxc will the latency be affected?

Cheers
 
Or a different way of asking the question, if the lxc shares the host kernel when using the infiniband card in the lxc will the latency be affected?
Latency should not be affected. The problem I see is: how you want to interact with the card.
 
Latency should not be affected. The problem I see is: how you want to interact with the card.
I admit I didn't think this far. what would be the complication? will I not be able to "see" it inside the lxc container and use it?
 
Or a different way of asking the question, if the lxc shares the host kernel when using the infiniband card in the lxc will the latency be affected?
if you're not using rdma it isnt likely to affect anything. hell, if you ARE using rdma it probably wont either; lxc isnt a virtualized environment in the first place.

Out of curiousity, what is the usecase that so concerns you with latency issues? whats on the other side?
 
if you're not using rdma it isnt likely to affect anything. hell, if you ARE using rdma it probably wont either; lxc isnt a virtualized environment in the first place.

Out of curiousity, what is the usecase that so concerns you with latency issues? whats on the other side?
We have a computing cluster that we are maintaining ourselves (we are researchers so not much sysadmin experience!). We have 9 nodes currently. At the moment we use proxmox on all and we created on each a VM to which we passthroughed (not a word I know!) the nodes ib card. The nodes jobs are sent to the nodes by a master node using slurm and as it is now it works. Parallel calculations work well because the nodes make use of the ib cards.

The downside of this setup (is not critical but it is nagging me) is that for the passthrough to work the VM cannot make use of the full ram memory of the node (i know a little bit must be reserved by the host). I have to give the VM about 14 Gb less of the total memory of 256 Gb (currently, we will add more). If I don't use the passthrough, then the VM can use the entire memory no problem.

This is not the only problem, using a passthorough it means that the VMs have sole access to the ib card and I was hoping that in the future we could use the setup to organize some workshops (for HPC computing and research) where we reorganize the VMs on the stop to accomodate different users and uses.

I was hoping that using LXCs would be a better route. Also, if this was possible and the latency of the ib cards would not be affected, would it be possible to use that ib card for more than one lxc? I am out of my depth I know but we have no other support for this.

Cheers
 
I was hoping that using LXCs would be a better route. Also, if this was possible and the latency of the ib cards would not be affected,
If the card is passed through, native performance is achieved subject to host operating kernel limitations.

edit- pci performance is impacted by any cpu limitations placed on the container so best results would be achieved if the container has access to all cores; using it in this manner begs the question of why bother virtualizing it at all.

using a passthorough it means that the VMs have sole access to the ib card and I was hoping that in the future we could use the setup to organize some workshops (for HPC computing and research) where we reorganize the VMs on the stop to accomodate different users and uses.
SO... it is not possible to bridge the HCAs if they're in IB mode; IB lacks much of the semantics required to route natively (requiring the use of an external subnet manager.) HOWEVER- If you can swap the entire network to ROCE you will retain access to rdma verbs AND the routability of ethernet- opening the possibility or using rdma behind a bridge. I have not ever done this before but it should be possible (at least on paper.)

Maybe someone here has done this/attempted this and has some light to shed.
 
Last edited:
If the card is passed through, native performance is achieved subject to host operating kernel limitations.

edit- pci performance is impacted by any cpu limitations placed on the container so best results would be achieved if the container has access to all cores; using it in this manner begs the question of why bother virtualizing it at all.


SO... it is not possible to bridge the HCAs if they're in IB mode; IB lacks much of the semantics required to route natively (requiring the use of an external subnet manager.) HOWEVER- If you can swap the entire network to ROCE you will retain access to rdma verbs AND the routability of ethernet- opening the possibility or using rdma behind a bridge. I have not ever done this before but it should be possible (at least on paper.)

Maybe someone here has done this/attempted this and has some light to shed.
sorry, I forgot to add that we have a in switch with SM capabilities. so that is covered by the switch. incidentally the cards only work in ib mode. the switch does not have (or we have not found it!) an ethernet mode.

the reason we virtualise is for the easy backup and the fact that we can experiment with different setups without affecting our main setup too much. I agree that virtualization see like too much but we had an older system that was installed on bare metal and it was a nightmare to install anything on it.
 
sorry, I forgot to add that we have a in switch with SM capabilities.
That was taken for granted ;)

incidentally the cards only work in ib mode.
unless they are connectx2, dont be so sure. it possible they're vpi cards which can operate in either mode; assuming mellanox, you can use MFT to find out and to set the mode (https://network.nvidia.com/products/adapter-software/firmware-tools/) If they are VPI, however, you would need to replace the switch to take advantage.
 
the cards are connectx5. so you are saying that if we turn them to ethernet mode the switch won't have a problem?

also, why should we switch to ethernet? I am a little bit confused to be honest.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!