CUDA in LXC Container

dlasher

Renowned Member
Mar 23, 2011
236
26
93
Wondering if anyone has been able to make nvidia-smi/cuda/etc work in an LXC container.

Feels like I'm close...configs added correctly in LXC:

lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file,uid=65534,gid=65534
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file,uid=65534,gid=65534
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file,uid=65534,gid=65534

devs are present, perms are set right:

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl


followed these:
http://sqream.com/setting-cuda-linux-containers-2/
https://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container
etc etc.

But nvidia-smi, hellocuda, devicequery etc all error out.

strace nvidia-smi -a

SNIP:
munmap(0x7fce2b7f6000, 4096) = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)
open("/dev/nvidiactl", O_RDONLY) = -1 EPERM (Operation not permitted)
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fce2b7f6000
write(1, "Failed to initialize NVML: Unkno"..., 41Failed to initialize NVML: Unknown Error


Even shows up right in the dmesg of the LXC container:
root@plex1:~# dmesg | grep -i nvidia
[ 23.727998] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input8
[ 23.728107] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input9
[ 23.728365] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input10
[ 23.728857] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input11
[ 1402.275320] nvidia 0000:04:00.0: enabling device (0006 -> 0007)
[ 1402.275884] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 1402.276120] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.44 Wed Aug 17 22:24:07 PDT 2016
[ 1402.353444] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241
[ 1402.361508] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.44 Wed Aug 17 21:54:40 PDT 2016
[ 1402.372654] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 1402.377787] [drm] [nvidia-drm] [GPU ID 0x00000400] Unloading driver
[ 1402.417083] nvidia-modeset: Unloading
[ 1402.444136] nvidia-uvm: Unloaded the UVM driver in 8 mode
[ 1402.473375] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
[ 1508.371671] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 1508.371771] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.44 Wed Aug 17 22:24:07 PDT 2016
[ 1508.381148] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.44 Wed Aug 17 21:54:40 PDT 2016
[ 1508.391240] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[178725.828295] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241


Also mentioned here : https://forum.proxmox.com/threads/kernel-sources-for-driver-module-compilation.27063/

Any suggestions?
 
Last edited:
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)

from this I would suppose a permission problem ....

as which user are you executing the nvidia-smi -a command inside the container ?

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl

is this output of ls in the host or in the container ?
 
  • Like
Reactions: naisanza
Last night I actually fixed this issue, I will post more later, but I read somewhere that making sure the template for the container was the same operating system as the host is important - so I rebuilt my container using Debian 8.6 instead of Ubuntu 16.04. The main difference is that the nvidia /dev/ nodes now have permissions of 'nobody:nogroup' whereas before they had 'root:root' and I couldn't change them. Now nvidia-smi works.

I'm now struggling with getting ffmpeg to use the h264_nvenc codec; it's currently throwing segmentation faults (on the host and the client) and I'm trying to figure out what is missing. But at least the guest seems to have full access to the GPU.
 
  • Like
Reactions: naisanza
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)

from this I would suppose a permission problem ....

as which user are you executing the nvidia-smi -a command inside the container ?

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl

is this output of ls in the host or in the container ?

That was from the container.
 
Is your template the same as your host? I mean, is your container template based off Debian 8.6? Switching to that fixed the issue for me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!