Hello,
we have a server with two NVIDIA GPUs (A100, L40s). After creating a new Ubuntu 22.04 virtual machine, adding both GPUs and installing the nvidia-drivers I realized, that only one shows up in nvidia-smi. In the following you can see the output of
After setting up an additional VM, I assigned each GPU to a separate VM, and both GPUs functioned correctly. Subsequently, I attempted to allocate both GPUs to this newly created VM. However, during this process, the PCI addresses were swapped, resulting in again only one of the GPUs being recognized by the system. Interestingly, the GPU that was previously undetected is now the one that is recognized.
So far I have tried setting the additional kernel parameters
I am using Proxmox VE 8.1.4 and Ubuntu 22.04.4 LTS VM with the kernel version 5.15.0-97-generic.
I am adding both GPUs as raw devices with all functions enabled and checkmarks set for ROM-Bar and PCI-Express. The checkmark for Primary GPU is disabled in both cases.
Thank you in advance
Best Regards
Lukas
we have a server with two NVIDIA GPUs (A100, L40s). After creating a new Ubuntu 22.04 virtual machine, adding both GPUs and installing the nvidia-drivers I realized, that only one shows up in nvidia-smi. In the following you can see the output of
lspci | grep -i nvidia
and sudo dmesg -T | grep -i nvidia
:
Bash:
lukasmetzner@node:~$ lspci | grep -i nvidia
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
02:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
Bash:
lukasmetzner@node:~$ sudo dmesg -T | grep -i nvidia
[Mon Mar 4 16:47:02 2024] nvidia: loading out-of-tree module taints kernel.
[Mon Mar 4 16:47:02 2024] nvidia: module license 'NVIDIA' taints kernel.
[Mon Mar 4 16:47:02 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[Mon Mar 4 16:47:02 2024] nvidia 0000:02:00.0: enabling device (0000 -> 0002)
[Mon Mar 4 16:47:02 2024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[Mon Mar 4 16:47:02 2024] nvidia: probe of 0000:02:00.0 failed with error -1
[Mon Mar 4 16:47:02 2024] NVRM: The NVIDIA probe routine failed for 1 device(s).
[Mon Mar 4 16:47:02 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[Mon Mar 4 16:47:02 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
[Mon Mar 4 16:47:02 2024] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[Mon Mar 4 16:47:04 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[Mon Mar 4 16:47:05 2024] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[Mon Mar 4 16:47:05 2024] nvidia-uvm: Loaded the UVM driver, major device number 510.
[Mon Mar 4 16:47:05 2024] audit: type=1400 audit(1709570826.252:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=844 comm="apparmor_parser"
[Mon Mar 4 16:47:05 2024] audit: type=1400 audit(1709570826.252:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=844 comm="apparmor_parser"
After setting up an additional VM, I assigned each GPU to a separate VM, and both GPUs functioned correctly. Subsequently, I attempted to allocate both GPUs to this newly created VM. However, during this process, the PCI addresses were swapped, resulting in again only one of the GPUs being recognized by the system. Interestingly, the GPU that was previously undetected is now the one that is recognized.
So far I have tried setting the additional kernel parameters
pci=realloc
and pci=realloc=off
, but no success.I am using Proxmox VE 8.1.4 and Ubuntu 22.04.4 LTS VM with the kernel version 5.15.0-97-generic.
I am adding both GPUs as raw devices with all functions enabled and checkmarks set for ROM-Bar and PCI-Express. The checkmark for Primary GPU is disabled in both cases.
Thank you in advance
Best Regards
Lukas