CUDA in LXC Container

dlasher · Oct 7, 2016

Wondering if anyone has been able to make nvidia-smi/cuda/etc work in an LXC container.

Feels like I'm close...configs added correctly in LXC:

lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file,uid=65534,gid=65534
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file,uid=65534,gid=65534
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file,uid=65534,gid=65534

devs are present, perms are set right:

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl

followed these:
http://sqream.com/setting-cuda-linux-containers-2/
https://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container
etc etc.

But nvidia-smi, hellocuda, devicequery etc all error out.

strace nvidia-smi -a

SNIP:
munmap(0x7fce2b7f6000, 4096) = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)
open("/dev/nvidiactl", O_RDONLY) = -1 EPERM (Operation not permitted)
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fce2b7f6000
write(1, "Failed to initialize NVML: Unkno"..., 41Failed to initialize NVML: Unknown Error

Even shows up right in the dmesg of the LXC container:
root@plex1:~# dmesg | grep -i nvidia
[ 23.727998] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input8
[ 23.728107] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input9
[ 23.728365] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input10
[ 23.728857] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:04:00.1/sound/card0/input11
[ 1402.275320] nvidia 0000:04:00.0: enabling device (0006 -> 0007)
[ 1402.275884] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 1402.276120] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.44 Wed Aug 17 22:24:07 PDT 2016
[ 1402.353444] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241
[ 1402.361508] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.44 Wed Aug 17 21:54:40 PDT 2016
[ 1402.372654] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 1402.377787] [drm] [nvidia-drm] [GPU ID 0x00000400] Unloading driver
[ 1402.417083] nvidia-modeset: Unloading
[ 1402.444136] nvidia-uvm: Unloaded the UVM driver in 8 mode
[ 1402.473375] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
[ 1508.371671] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 1508.371771] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.44 Wed Aug 17 22:24:07 PDT 2016
[ 1508.381148] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.44 Wed Aug 17 21:54:40 PDT 2016
[ 1508.391240] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[178725.828295] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241

Also mentioned here : https://forum.proxmox.com/threads/kernel-sources-for-driver-module-compilation.27063/

Any suggestions?

elBradford · Dec 28, 2016

I have this exact problem right now, did you ever figure it out?

manu · Dec 28, 2016

open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)

from this I would suppose a permission problem ....

as which user are you executing the nvidia-smi -a command inside the container ?

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl

is this output of ls in the host or in the container ?

elBradford · Dec 28, 2016

Last night I actually fixed this issue, I will post more later, but I read somewhere that making sure the template for the container was the same operating system as the host is important - so I rebuilt my container using Debian 8.6 instead of Ubuntu 16.04. The main difference is that the nvidia /dev/ nodes now have permissions of 'nobody:nogroup' whereas before they had 'root:root' and I couldn't change them. Now nvidia-smi works.

I'm now struggling with getting ffmpeg to use the h264_nvenc codec; it's currently throwing segmentation faults (on the host and the client) and I'm trying to figure out what is missing. But at least the guest seems to have full access to the GPU.

dlasher · Dec 29, 2016

manu said:
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)

from this I would suppose a permission problem ....

as which user are you executing the nvidia-smi -a command inside the container ?

root@plex1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 nobody nogroup 241, 0 Oct 6 14:02 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 195, 0 Oct 4 19:23 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Oct 4 19:23 /dev/nvidiactl

is this output of ls in the host or in the container ?

That was from the container.

elBradford · Dec 29, 2016

Is your template the same as your host? I mean, is your container template based off Debian 8.6? Switching to that fixed the issue for me.

dlasher · Dec 29, 2016

elBradford said:
Is your template the same as your host? I mean, is your container template based off Debian 8.6? Switching to that fixed the issue for me.

no, might try that.. template is ubuntu14.04LTS

elBradford · Dec 30, 2016

To follow up, I successfully passed through my Nvidia GT 710 to a LXC container where I used ffmpeg to do GPU encoding on a 1080p video - 115 fps or so. I did a write-up on it, you might be interested in it:
https://bradford.la/2016/GPU-FFMPEG-in-LXC

How To: GPU (NVENC) Accelerated FFMPEG in an LXC Container

DocMAX · Dec 8, 2024

Code:

[root@ai-game demo_suite]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

Bekomme hier einen Fehler obwohl nvidia-smi im LXC funtioniert.

Search

Search

CUDA in LXC Container

dlasher

Renowned Member

elBradford

Renowned Member

manu

Proxmox Staff Member

elBradford

Renowned Member

dlasher

Renowned Member

elBradford

Renowned Member

dlasher

Renowned Member

elBradford

Renowned Member

DocMAX

Member