[SOLVED] Sharing GPU to LXC container - Failed to initialize NVML: Unknown Error

ntblade

Renowned Member
Apr 29, 2011
22
2
68
Hi all,
I'm trying to share a GPU with a Debian Bullseye (11) container. I installed the nvidia driver using NVIDIA-Linux-x86_64-390.144.run on the proxmox host and then on the container.
Host:
Code:
nvidia-smi
Sat Oct 30 22:27:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.144                Driver Version: 390.144                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 600          Off  | 00000000:05:00.0 Off |                  N/A |
| 30%   62C    P0    N/A /  N/A |      0MiB /   963MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

/etc/modules-load.d/modules.conf:
Code:
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

# Nvidia modules
nvidia
nvidia_uvm

/etc/udev/rules.d/70-nvidia.rules:
Code:
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

Here's the the dev list and container config on the host:
Code:
ls -la /dev/nvid*
crw-rw-rw- 1 root root 195,   0 Oct 30 22:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct 30 22:16 /dev/nvidiactl
crw-rw-rw- 1 root root 239,   0 Oct 30 22:16 /dev/nvidia-uvm
crw-rw-rw- 1 root root 239,   1 Oct 30 22:16 /dev/nvidia-uvm-tools

lxc.cgroup.devices.allow: c 195:* rw
lxc.cgroup.devices.allow: c 239:* rw
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
and on the container:
Code:
ls -la /dev/nvidia*
crw-rw-rw- 1 root root 239,   0 Oct 30 21:16 /dev/nvidia-uvm
crw-rw-rw- 1 root root 239,   1 Oct 30 21:16 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Oct 30 21:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct 30 21:16 /dev/nvidiactl

However, when I run nvidia-smi on the container:
Code:
nvidia-smi
Failed to initialize NVML: Unknown Error

Is anyone able to help please?

Thanks

NTB
 
Last edited: