Problem with Stability / Containers using GPU

sirebral · Jun 14, 2022

Hey all!

I am on the latest upgrades to the community edition of Proxmox. I have been running just fine for quite some time, yet now all of a sudden I'm having a major issue. I have 3 containers that need to access my NVIDIA GPU. They're configured like this:

-----

arch: amd64
cores: 6
hostname: plex.cotb.local.lan
memory: 8048
mp0: /hybrid/media,mp=/media,mountoptions=noatime
net0: name=eth0,bridge=vmbr0,gw=172.16.25.1,hwaddr=C2:59:0C:A9:2E:74,ip=172.16.25.24/24,type=veth
onboot: 1
ostype: ubuntu
protection: 1
rootfs: flash:subvol-102-disk-1,mountoptions=noatime,size=70G
swap: 512
unprivileged: 0
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

-----

As mentioned, this was fine for quite some time. Now with the latest updates (actually the last 2 kernel releases) I'm having these problems.

1. When the host boots, the containers using the GPU don't start by themselves. They're set to auto-boot, yet they don't. I have to start them manually.
– They run fine, and seem to have no issues accessing the card after the manual boot. The host and guest are runing the latest NVIDIA Linux drivers, upgraded again today to see if this made any difference, it did not.

2. If I try to reboot the containers in question or power them down, the entire host box dies. I lose all connectivity and the web and text console completely freeze. My only chance for recovery is a hard reboot, which is not something I really want to ever do.

The one error message I've been able to see has been in relation to a CPU pause error. This appeared in the soft-console of the containers a few times. I'm still picking up Linux as my core (25 years) has been in Microsoft, professionally. So if there are particular logs you'd like to see, please let me know, and I'll add them to the post.

As a side-note (hint?) If I don't touch the containers that have the graphics card in their config, I can use everything as expected without any weird issues, lockups, etc. This used to be the way all of my containers worked and I'm really hoping to resolve the issue so I cant get back to reliable systems.

Thanks in advance for anything you can offer!

shrdlicka · Jun 15, 2022

Hi,

did you already check the logs?
You can list the boots with

Code:

journalctl --list-boots

then you can print all the logs for the boot before the current one with this

Code:

journalctl -b -1

Just select the one where the system crashed.

You can also print just logs with log level error or higher with:

Code:

journalctl -b -1 -p 0..3

Check if you see some errors in there.

sirebral · Jun 18, 2022

Cool. My post showed back up!

So, since my last post I've pinned the kernel at 5.13.19-6-pve and the issues went away. I know the community code is considered unstable, yet there seems to be some sort of bug in the latest releases. Would you still be interested in the syslog? I can archive it and post, perhaps we can figure out the bug. I've seen chatter from other folks with similar problems, so perhaps we'd be doing the community as a whole as favor...?

sirebral · Jun 18, 2022

I'm going to recant, I do still have one issue, although it's no longer crashing when I reboot a guest container it still isn't autobooting any containers with the configuration for NVIDIA in the container. So, something is still not quite right. I also noted that when I check the kernel modules I see this.

Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

This seems odd as I have an active blacklist entry in modprob.d. Does this just mean it's present yet not loaded? I belive those are built into the kernel, yet I also see the ProxMox has a pre-defined blacklist for nvidiab, so perhaps the blacklist isn't working? IT looks like this:

GNU nano 5.4 pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

More clues, or am I barking up the wrong tree?

Cheers!

Search

Search

Problem with Stability / Containers using GPU

sirebral

Member

shrdlicka

Proxmox Retired Staff

sirebral

Member

sirebral

Member

We value your privacy