Following the upgrade from PVE8 to PVE9, I can no longer start my LXCs which are utilizing the NVIDIA Container Toolkit.
I have a variety of containers that attach to the same GPU for things like transcoding, rendering, stable diffusion, LLMs, etc -- so passthrough\dedicated assignemnt is a nuisance, and the container toolkit provides an elegant method to share the GPU between containers seamless and with low administrative overhead.
Some details:
In my LXC debug logs during container startup I see entries like the following:
I am at something of a loss here, and hoping not to have to revert back to PVE8. Is there a convienent way to embed appropriate cgroup2 support into the LXC to enable usability with the NVIDIA Container Toolkit?
I have a variety of containers that attach to the same GPU for things like transcoding, rendering, stable diffusion, LLMs, etc -- so passthrough\dedicated assignemnt is a nuisance, and the container toolkit provides an elegant method to share the GPU between containers seamless and with low administrative overhead.
Some details:
- was previously on the optional 6.14 kernel on PVE8, no issues there
- using NVIDIA Driver 570.181, upgraded from 570.172.08 during troubleshooting
- using NVIDIA Container Toolkit 1.18.0~rc.2, upgraded from 1.17.8 during troubleshooting
lxc.hook.pre-start: sh -c '[ ! -f /dev/nvidia-uvm ] && /usr/bin/nvidia-modprobe -c0 -u'
lxc.environment: NVIDIA_VISIBLE_DEVICES=all
lxc.environment: NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
lxc.hook.mount: /usr/share/lxc/hooks/nvidia
In my LXC debug logs during container startup I see entries like the following:
- DEBUG utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia 117 lxc mount produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig --device=all --compute --utility --video /usr/lib/x86_64-linux-gnu/lxc/rootfs
- DEBUG utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia 117 lxc mount produced output: nvidia-container-cli: mount error: open failed: /usr/lib/x86_64-linux-gnu/lxc/rootfs/proc/1/ns/mnt: permission denied
- DEBUG utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia 117 lxc mount produced output: + exec nvidia-container-cli --user configure --ldconfig=@/usr/sbin/ldconfig --device=all --compute --utility --video /usr/lib/x86_64-linux-gnu/lxc/rootfs
- DEBUG utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia 117 lxc mount produced output: nvidia-container-cli: container error: failed to get device cgroup mount path: relative path in mount prefix: /../../..
/usr/share/lxc/hooks/nvidia
' hook can only be run in the unpriviledged context.I am at something of a loss here, and hoping not to have to revert back to PVE8. Is there a convienent way to embed appropriate cgroup2 support into the LXC to enable usability with the NVIDIA Container Toolkit?