giving LXC direct GPU access for host/lxc CUDA+vGPU?

zenowl77

Active Member
Feb 22, 2024
207
42
28
okay so i have setup a merged driver with both KVM/vGPU and standard features, so the nvidia modules nvidia-modeset, nvidia-uvm & nvidia-uvm-tools are available on the proxmox host machine. i have tried every guide, tutorial, help post, etc i can find online for every type of GPU to find any way to make it work, best i have gotten so far is the processes showed up on the host nvidia-smi list but they error out saying unknown cuda error...

GPU is a Nvidia Tesla P4, i have tried driver 17.2/550.90.05 and 16.4/535.161.05 merged drivers, currently on 16.4 since 17+ causes issues in linux VMs with vgpu and nvidia-smi claiming a driver mismatch on any driver version. (windows is fine using 535 but i don't want to be stuck on windows vms only)

i am guessing maybe this is a permissions issue, possibly to be solved with lxc.idmap: and/or permissions corrections on the host? but everything i have tried doesnt work....

does anyone have LXC cuda/encoding in docker/jellyfin working at the same time as vGPU with a nvidia GPU? what did you have to do to get it working?

Docker lxc.conf:
Code:
arch: amd64
cores: 16
features: mknod=1,nesting=1
hostname: docker
memory: 8192
mp1: /mnt/10TB-2,mp=/mnt/10TB-2
mp2: /mnt/8TB,mp=/mnt/8TB
nameserver: 10.0.0.1
net0: name=eth0,bridge=vmbr0,gw=10.0.0.1,hwaddr=BC:24:11:15:95:AD,ip=10.0.0.220/24,type=veth
onboot: 1
ostype: debian
rootfs: local-lvm:vm-118-disk-0,size=640G
swap: 0
tags: proxmox-helper-scripts
lxc.cgroup2.devices.allow: a
lxc.cap.drop: 
lxc.cgroup2.devices.allow: c 188:* rwm
lxc.cgroup2.devices.allow: c 189:* rwm
lxc.cgroup2.devices.allow: c 29:0 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/net dev/net none bind,create=dir
lxc.hook.pre-start: sh -c '[ ! -f /dev/nvidia0 ] && /usr/bin/nvidia-modprobe -c0 -u'
lxc.environment: NVIDIA_VISIBLE_DEVICES=all
lxc.environment: NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
lxc.cap.drop: 
lxc.cgroup2.devices.allow: c 188:* rwm
lxc.cgroup2.devices.allow: c 189:* rwm
lxc.cgroup2.devices.allow: c 29:0 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/net dev/net none bind,create=dir
lxc.cgroup2.devices.allow: c 10:* rwm
lxc.cgroup2.devices.allow: c 508:* rwm
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 506:* rwm
lxc.cgroup2.devices.allow: c 507:* rwm
lxc.cgroup2.devices.allow: c 510:* rwm
lxc.cgroup2.devices.allow: c 128:* rwm
lxc.cgroup2.devices.allow: c 129:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvram dev/nvram none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD129 dev/dri/renderD128 none bind,optional,create=file

the docker lxc sees the gpu wit nvidia-smi (although it randomly stops seeing it and has issues etc)


Code:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:17:00.0 Off |                  Off |
| N/A   33C    P0              22W /  75W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
 
Hi,
Currently I'm having the same issue. According to all manuals and NVIDIA documentation CUDA should "just work", but that's not really the case.
I've narrowed down the issue with missing libcuda1, which is installed with the driver package and it breaks the vGPU.
I can install CUDA 12.4 with 550.127.06 on the host and on LXC containers.. nvcc works, but can't install the libcuda1 which holds libcuda.so.1

By default when you try to install libcuda from the network repo it will install 570xx.
There's 550 but I'm failing to install them

The following packages have unmet dependencies:
libcuda1 : Depends: nvidia-alternative (= 550.127.05-1)
Depends: libnvidia-ptxjitcompiler1 (= 550.127.05-1) but 570.86.15-1 is to be installed
Depends: libnvidia-nvvm4 (= 550.127.05-1) but 570.86.15-1 is to be installed
Recommends: nvidia-kernel-dkms (= 550.127.05-1) but it is not going to be installed or
nvidia-kernel-550.127.05
Recommends: nvidia-smi
Recommends: libnvidia-cfg1 (= 550.127.05-1) but it is not going to be installed
Recommends: nvidia-persistenced but it is not going to be installed
Recommends: libcuda1-i386 (= 550.127.05-1) but it is not installable
E: Unable to correct problems, you have held broken packages.
 
Last edited: