Hey all!
I am on the latest upgrades to the community edition of Proxmox. I have been running just fine for quite some time, yet now all of a sudden I'm having a major issue. I have 3 containers that need to access my NVIDIA GPU. They're configured like this:
-----
arch: amd64
cores: 6
hostname: plex.cotb.local.lan
memory: 8048
mp0: /hybrid/media,mp=/media,mountoptions=noatime
net0: name=eth0,bridge=vmbr0,gw=172.16.25.1,hwaddr=C2:59:0C:A9:2E:74,ip=172.16.25.24/24,type=veth
onboot: 1
ostype: ubuntu
protection: 1
rootfs: flash:subvol-102-disk-1,mountoptions=noatime,size=70G
swap: 512
unprivileged: 0
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
-----
As mentioned, this was fine for quite some time. Now with the latest updates (actually the last 2 kernel releases) I'm having these problems.
1. When the host boots, the containers using the GPU don't start by themselves. They're set to auto-boot, yet they don't. I have to start them manually.
– They run fine, and seem to have no issues accessing the card after the manual boot. The host and guest are runing the latest NVIDIA Linux drivers, upgraded again today to see if this made any difference, it did not.
2. If I try to reboot the containers in question or power them down, the entire host box dies. I lose all connectivity and the web and text console completely freeze. My only chance for recovery is a hard reboot, which is not something I really want to ever do.
The one error message I've been able to see has been in relation to a CPU pause error. This appeared in the soft-console of the containers a few times. I'm still picking up Linux as my core (25 years) has been in Microsoft, professionally. So if there are particular logs you'd like to see, please let me know, and I'll add them to the post.
As a side-note (hint?) If I don't touch the containers that have the graphics card in their config, I can use everything as expected without any weird issues, lockups, etc. This used to be the way all of my containers worked and I'm really hoping to resolve the issue so I cant get back to reliable systems.
Thanks in advance for anything you can offer!
I am on the latest upgrades to the community edition of Proxmox. I have been running just fine for quite some time, yet now all of a sudden I'm having a major issue. I have 3 containers that need to access my NVIDIA GPU. They're configured like this:
-----
arch: amd64
cores: 6
hostname: plex.cotb.local.lan
memory: 8048
mp0: /hybrid/media,mp=/media,mountoptions=noatime
net0: name=eth0,bridge=vmbr0,gw=172.16.25.1,hwaddr=C2:59:0C:A9:2E:74,ip=172.16.25.24/24,type=veth
onboot: 1
ostype: ubuntu
protection: 1
rootfs: flash:subvol-102-disk-1,mountoptions=noatime,size=70G
swap: 512
unprivileged: 0
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
-----
As mentioned, this was fine for quite some time. Now with the latest updates (actually the last 2 kernel releases) I'm having these problems.
1. When the host boots, the containers using the GPU don't start by themselves. They're set to auto-boot, yet they don't. I have to start them manually.
– They run fine, and seem to have no issues accessing the card after the manual boot. The host and guest are runing the latest NVIDIA Linux drivers, upgraded again today to see if this made any difference, it did not.
2. If I try to reboot the containers in question or power them down, the entire host box dies. I lose all connectivity and the web and text console completely freeze. My only chance for recovery is a hard reboot, which is not something I really want to ever do.
The one error message I've been able to see has been in relation to a CPU pause error. This appeared in the soft-console of the containers a few times. I'm still picking up Linux as my core (25 years) has been in Microsoft, professionally. So if there are particular logs you'd like to see, please let me know, and I'll add them to the post.
As a side-note (hint?) If I don't touch the containers that have the graphics card in their config, I can use everything as expected without any weird issues, lockups, etc. This used to be the way all of my containers worked and I'm really hoping to resolve the issue so I cant get back to reliable systems.
Thanks in advance for anything you can offer!
Last edited: