NVIDIA A4000 LXC Passthrough PVE 8.4.1 - Intermittent /dev/nvidia* & /dev/dri/* creation, udev/nvidia-modprobe issues

danpardy · 2025-05-18T13:46:32+0200

Hello Proxmox Community,

I'm seeking assistance with a persistent issue getting NVIDIA RTX A4000 GPU passthrough to work reliably for an LXC container (Debian) intended for Docker workloads on my Proxmox VE system. The core problem seems to be the inconsistent creation of all necessary /dev/nvidia* and the correct /dev/dri/* device nodes on the Proxmox host, likely related to udev rule processing or nvidia-modprobe behavior.

My Goal:
Successfully pass through my NVIDIA RTX A4000 to a privileged LXC container for either use of Plex Transcoding, or AI tasks - Ollama, etc.

System Configuration:

Proxmox VE Version: Proxmox VE 8.4.1
Kernel Version: 6.8.12-10-pve
NVIDIA GPU: NVIDIA RTX A4000
NVIDIA Driver Version (Host): 535.216.01
Driver Installation Method: Currently attempting installation via the official NVIDIA .run file (e.g., NVIDIA-Linux-x86_64-535.216.01.run) with DKMS, after previous unsuccessful attempts with apt.
LXC OS: Ubuntu 24.04

LXC Configuration Snippet (/etc/pve/lxc/103.conf):
arch: amd64
cores: 4
dev0: /dev/nvidia0
dev1: /dev/nvidiactl
dev2: /dev/nvidia-modeset
dev3: /dev/nvidia-uvm
dev4: /dev/dri/renderD128
features: keyctl=1,nesting=1
hostname: ollama
memory: 4096
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:FB:64:04,ip=dhcp,type=veth
onboot: 1
ostype: ubuntu
rootfs: nvme-pool:subvol-103-disk-0,size=35G
swap: 512
tags: ai;community-script
unprivileged: 0
lxc.log.level: TRACE
lxc.log.file: /var/log/lxc/lxc-103.log # Log file for container 103

Problem Description:

The primary issue is the failure to consistently get all required NVIDIA device nodes created on the Proxmox host after installing the NVIDIA drivers and configuring udev. Specifically, /dev/nvidia-uvm has often been missing, and sometimes other /dev/nvidia* devices. While /dev/dri/card1 and /dev/dri/renderD128 (linked to the A4000) might appear, the absence of the core /dev/nvidia* UVM device prevents the container from utilizing the GPU correctly. nvidia-smi on the host does eventually work, but the device nodes for passthrough are problematic.

Summary of Troubleshooting Steps Taken:

Initial Setup & Problem:
- Attempted GPU passthrough to LXC, container failed to start or nvidia-smi inside container failed, often citing missing /dev/nvidia-uvm.
- Confirmed IOMMU is enabled in BIOS and Proxmox kernel command line.
Driver Installation Attempts:
- Tried installing drivers via apt (nvidia-driver package). This often resulted in nvidia-smi working on the host but /dev/nvidia-uvm and other necessary device files for passthrough not being created.
- Switched to installing drivers via the official NVIDIA .run file (NVIDIA-Linux-x86_64-535.216.01.run) with the --dkms option.
Udev Rule Debugging:
- Initially, no specific NVIDIA udev rules were present beyond 60-nvidia-kernel-common.rules (which only handles ACLs).
- Manually created /etc/udev/rules.d/70-nvidia.rules.
- Experimented with various udev rule contents. A key finding was that the nvidia-modprobe command on my system does not support the -d flag.
- The following nvidia-modprobe commands were found to work when run manually for specific device creation:
  - nvidia-modprobe -c 0 -c 255 (for /dev/nvidia0 and /dev/nvidiactl)
  - nvidia-modprobe -m (for /dev/nvidia-modeset)
  - nvidia-modprobe -u (for /dev/nvidia-uvm and /dev/nvidia-uvm-tools)
- The current /etc/udev/rules.d/70-nvidia.rules content I am testing is:
- # UDEV rules for NVIDIA devices
  <pre> SUBSYSTEM=="module", KERNEL=="nvidia", ACTION=="add", RUN+="/usr/bin/nvidia-modprobe -c 0 -c 255"
  SUBSYSTEM=="module", KERNEL=="nvidia_modeset", ACTION=="add", RUN+="/usr/bin/nvidia-modprobe -m"
  SUBSYSTEM=="module", KERNEL=="nvidia_uvm", ACTION=="add", RUN+="/usr/bin/nvidia-modprobe -u"
  
  KERNEL=="nvidia[0-9]*|nvidiactl|nvidia-modeset|nvidia-uvm|nvidia-uvm-tools", MODE="0666"
  
  SUBSYSTEM=="drm", KERNEL=="card*", DRIVERS=="nvidia", TAG+="master-of-seat", TAG+="keepme-for-hotplug", MODE="0660", GROUP="video"
  SUBSYSTEM=="drm", KERNEL=="renderD*", DRIVERS=="nvidia", MODE="0660", GROUP="render" </pre>
Nouveau Blacklisting: Confirmed nouveau is blacklisted (/etc/modprobe.d/blacklist-nouveau.conf).
System State Checks:
- lsmod | grep nvidia confirms nvidia, nvidia_modeset, nvidia_drm, nvidia_uvm modules load.
- dkms status shows the NVIDIA module as built/installed for the current kernel.
- nvidia-smi on the host works and shows the A4000.
- Reviewed journalctl -k | grep -i "drm\\|nvidia\\|nouveau\\|01:00.0" (PCI ID of A4000) which shows modules loading.
- Reviewed journalctl --no-pager -u systemd-udevd for errors related to NVIDIA rules.
Cleanup and Reinstallation: Performed thorough purges of NVIDIA drivers (apt autoremove --purge 'nvidia-*', NVIDIA*.run --uninstall), removed manual udev/modprobe configs, and rebooted before attempting clean reinstallations with the .run file and the custom udev rules above.
Followed External Guide: Adapted steps from https://digitalspaceport.com/proxmox-lxc-gpu-passthru-setup-guide/, particularly for the .run file installation method and general structure, but customized the udev rules based on my nvidia-modprobe findings.

Current Status & Logs to Provide:

(At this point, you should run these commands after your latest attempt based on the cleanup and reinstallation using the .run file and the custom udev rules, and include the output in your forum post)

Output of ls -l /dev/nvidia* /dev/dri/*
- <pre> crw-rw---- 1 root video 226, 0 May 17 17:07 /dev/dri/card0
  crw-rw---- 1 root video 226, 1 May 17 17:07 /dev/dri/card1
  crw-rw---- 1 root render 226, 128 May 17 17:07 /dev/dri/renderD128
  crw-rw-rw- 1 root root 195, 0 May 17 17:07 /dev/nvidia0
  crw-rw-rw- 1 root root 195, 255 May 17 17:07 /dev/nvidiactl
  crw-rw-rw- 1 root root 195, 254 May 17 17:07 /dev/nvidia-modeset
  
  /dev/dri/by-path:
  total 0
  lrwxrwxrwx 1 root root 8 May 17 17:07 pci-0000:01:00.0-card -> ../card1
  lrwxrwxrwx 1 root root 13 May 17 17:07 pci-0000:01:00.0-render -> ../renderD128
  lrwxrwxrwx 1 root root 8 May 17 17:07 pci-0000:7a:00.0-platform-simple-framebuffer.0-card -> ../card0 </pre>
Content of /etc/udev/rules.d/70-nvidia.rules (as listed above).
Output of nvidia-smi on the host.
1. <pre> +-----------------------------------------------------------------------------------------+
  | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
  |-----------------------------------------+------------------------+----------------------+
  | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
  | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
  | | | MIG M. |
  |=========================================+========================+======================|
  | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off |
  | 41% 26C P8 3W / 140W | 1MiB / 16376MiB | 0% Default |
  | | | N/A |
  +-----------------------------------------+------------------------+----------------------+
  
  +-----------------------------------------------------------------------------------------+
  | Processes: |
  | GPU GI CI PID Type Process name GPU Memory |
  | ID ID Usage |
  |=========================================================================================|
  | No running processes found |
  +-----------------------------------------------------------------------------------------+ </pre>
Relevant recent lines from journalctl -u systemd-udevd --no-pager | grep nvidia
May 17 15:38:17 proxmox (udev-worker)[783]: nvidia: Process '/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia\*'' failed with exit code 1.
May 17 15:52:25 proxmox systemd-udevd[20388]: /etc/udev/rules.d/70-nvidia.rules:12 Invalid key/value pair, ignoring.
Relevant recent lines from journalctl -k --no-pager | grep -Ei "nvidia|uvm|drm|01:00.0" (your GPU's PCI ID).
1. <pre> May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: [10de:24b0] type 00 class 0x030000 PCIe Legacy Endpoint
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: BAR 0 [mem 0xd9000000-0xd9ffffff]
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: BAR 1 [mem 0xf000000000-0xf7ffffffff 64bit pref]
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: BAR 3 [mem 0xf800000000-0xf801ffffff 64bit pref]
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: BAR 5 [io 0xf000-0xf07f]
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: ROM [mem 0xda000000-0xda07ffff pref]
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: PME# supported from D0 D3hot
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: vgaarb: setting as boot VGA device
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: vgaarb: bridge control possible
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
  May 17 17:07:31 proxmox kernel: pci 0000:01:00.0: Adding to iommu group 14
  May 17 17:07:31 proxmox kernel: ACPI: bus type drm_connector registered
  May 17 17:07:31 proxmox kernel: [drm] Initialized simpledrm 1.0.0 20200625 for simple-framebuffer.0 on minor 0
  May 17 17:07:31 proxmox kernel: simple-framebuffer simple-framebuffer.0: [drm] fb0: simpledrmdrmfb frame buffer device
  May 17 17:07:32 proxmox systemd[1]: Starting modprobe@drm.service - Load Kernel Module drm...
  May 17 17:07:32 proxmox systemd[1]: modprobe@drm.service: Deactivated successfully.
  May 17 17:07:32 proxmox systemd[1]: Finished modprobe@drm.service - Load Kernel Module drm.
  May 17 17:07:32 proxmox kernel: nvidia: loading out-of-tree module taints kernel.
  May 17 17:07:32 proxmox kernel: nvidia: module license 'NVIDIA' taints kernel.
  May 17 17:07:32 proxmox kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
  May 17 17:07:32 proxmox kernel: nvidia: module license taints kernel.
  May 17 17:07:32 proxmox kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
  May 17 17:07:32 proxmox kernel: nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=nonewns=none
  May 17 17:07:32 proxmox kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
  May 17 17:07:32 proxmox kernel: [drm] amdgpu kernel modesetting enabled.
  May 17 17:07:32 proxmox kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
  May 17 17:07:32 proxmox kernel: [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x13C0 0x1043:0x8877 0xC1).
  May 17 17:07:32 proxmox kernel: [drm] register mmio base: 0xDBD00000
  May 17 17:07:32 proxmox kernel: [drm] register mmio size: 524288
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 0 <nv_common>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 1 <gmc_v10_0>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 2 <navi10_ih>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 3 <psp>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 4 <smu>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 5 <dm>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 6 <gfx_v10_0>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 7 <sdma_v5_2>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 8 <vcn_v3_0>
  May 17 17:07:32 proxmox kernel: [drm] add ip block number 9 <jpeg_v3_0>
  May 17 17:07:32 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <psp> failed -19
  May 17 17:07:32 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <dm> failed -19
  May 17 17:07:32 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <gfx_v10_0> failed -19
  May 17 17:07:32 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <sdma_v5_2> failed -19
  May 17 17:07:32 proxmox kernel: [drm] VCN(0) decode is enabled in VM mode
  May 17 17:07:32 proxmox kernel: [drm] VCN(0) encode is enabled in VM mode
  May 17 17:07:32 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <vcn_v3_0> failed -19
  May 17 17:07:32 proxmox kernel: [drm] JPEG decode is enabled in VM mode
  May 17 17:07:32 proxmox kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
  May 17 17:07:33 proxmox kernel: audit: type=1400 audit(1747490853.166:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1382 comm="apparmor_parser"
  May 17 17:07:33 proxmox kernel: audit: type=1400 audit(1747490853.166:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1382 comm="apparmor_parser"
  May 17 17:07:33 proxmox kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
  May 17 17:07:33 proxmox kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
  May 17 17:07:33 proxmox kernel: nvidia-uvm: Loaded the UVM driver, major device number 506.
  </pre>

Specific Questions for the Community:

Despite the NVIDIA modules (including nvidia_uvm) loading correctly and nvidia-smi working on the host, why might the /dev/nvidia* device nodes (especially /dev/nvidia-uvm) still fail to be created by my custom udev rules which use nvidia-modprobe commands that work when tested manually?
Are there known quirks with nvidia-modprobe or udev interactions on Proxmox VE 8.x (Debian 12 Bookworm) with NVIDIA driver series 535.x that could explain this behavior?
Is there a more robust or recommended way to ensure all necessary NVIDIA device nodes (including UVM and DRM nodes for the correct GPU) are created for LXC passthrough on Proxmox, especially when using the .run file installer?
Could the AMDGPU errors seen in my kernel log ( [drm:amdgpu_device_init [amdgpu]] *ERROR* early_init of IP block <...> failed -19 ), presumably for an integrated GPU, be interfering with the NVIDIA udev rule processing, even if the NVIDIA kernel modules themselves load?

Any insights or suggestions would be greatly appreciated. I've been troubleshooting this for a while and seem to be stuck on this final device creation step on the host.

Thank you!

Search

Search

NVIDIA A4000 LXC Passthrough PVE 8.4.1 - Intermittent /dev/nvidia* & /dev/dri/* creation, udev/nvidia-modprobe issues

danpardy

New Member

We value your privacy