Only one of two NVIDIA GPUs works in a single VM

lukasmetzner · Mar 5, 2024

Hello,
we have a server with two NVIDIA GPUs (A100, L40s). After creating a new Ubuntu 22.04 virtual machine, adding both GPUs and installing the nvidia-drivers I realized, that only one shows up in nvidia-smi. In the following you can see the output of lspci | grep -i nvidia and sudo dmesg -T | grep -i nvidia:

Bash:

lukasmetzner@node:~$ lspci | grep -i nvidia
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
02:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)

Bash:

lukasmetzner@node:~$ sudo dmesg -T | grep -i nvidia
[Mon Mar  4 16:47:02 2024] nvidia: loading out-of-tree module taints kernel.
[Mon Mar  4 16:47:02 2024] nvidia: module license 'NVIDIA' taints kernel.
[Mon Mar  4 16:47:02 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[Mon Mar  4 16:47:02 2024] nvidia 0000:02:00.0: enabling device (0000 -> 0002)
[Mon Mar  4 16:47:02 2024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[Mon Mar  4 16:47:02 2024] nvidia: probe of 0000:02:00.0 failed with error -1
[Mon Mar  4 16:47:02 2024] NVRM: The NVIDIA probe routine failed for 1 device(s).
[Mon Mar  4 16:47:02 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[Mon Mar  4 16:47:02 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.14  Thu Feb 22 01:25:25 UTC 2024
[Mon Mar  4 16:47:02 2024] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[Mon Mar  4 16:47:04 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[Mon Mar  4 16:47:05 2024] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[Mon Mar  4 16:47:05 2024] nvidia-uvm: Loaded the UVM driver, major device number 510.
[Mon Mar  4 16:47:05 2024] audit: type=1400 audit(1709570826.252:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=844 comm="apparmor_parser"
[Mon Mar  4 16:47:05 2024] audit: type=1400 audit(1709570826.252:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=844 comm="apparmor_parser"

After setting up an additional VM, I assigned each GPU to a separate VM, and both GPUs functioned correctly. Subsequently, I attempted to allocate both GPUs to this newly created VM. However, during this process, the PCI addresses were swapped, resulting in again only one of the GPUs being recognized by the system. Interestingly, the GPU that was previously undetected is now the one that is recognized.

So far I have tried setting the additional kernel parameters pci=realloc and pci=realloc=off, but no success.

I am using Proxmox VE 8.1.4 and Ubuntu 22.04.4 LTS VM with the kernel version 5.15.0-97-generic.
I am adding both GPUs as raw devices with all functions enabled and checkmarks set for ROM-Bar and PCI-Express. The checkmark for Primary GPU is disabled in both cases.

Thank you in advance
Best Regards
Lukas

dcsapak · Mar 6, 2024

lukasmetzner said:
So far I have tried setting the additional kernel parameters pci=realloc and pci=realloc=off, but no success.

where did you set this? on the host or the guest? (try vice versa, or both)

alternatively, could you try to add the following to the 'args' part of the config (i assume you use ovmf to boot the vm?):

Code:

-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536

you could do this with

Code:

qm set ID --args '-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536'

lukasmetzner · Mar 7, 2024

Hello,
thank you for your response. We have tried both of your solutions and also combining them, but unfortunately it did not work.
We have also tried out steps from this issue: https://forum.proxmox.com/threads/multi-gpu-passthrough-4g-decoding-error.49479/

Code:

[Thu Mar  7 11:38:58 2024] nvidia: loading out-of-tree module taints kernel.
[Thu Mar  7 11:38:58 2024] nvidia: module license 'NVIDIA' taints kernel.
[Thu Mar  7 11:38:58 2024] Disabling lock debugging due to kernel taint
[Thu Mar  7 11:38:58 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 235

[Thu Mar  7 11:38:58 2024] nvidia 0000:02:00.0: enabling device (0000 -> 0002)
[Thu Mar  7 11:38:58 2024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)
[Thu Mar  7 11:38:58 2024] nvidia: probe of 0000:02:00.0 failed with error -1
[Thu Mar  7 11:38:58 2024] nvidia 0000:03:00.0: enabling device (0000 -> 0002)
[Thu Mar  7 11:38:58 2024] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR0 is 0M @ 0x0 (PCI:0000:03:00.0)
[Thu Mar  7 11:38:58 2024] nvidia: probe of 0000:03:00.0 failed with error -1
[Thu Mar  7 11:38:58 2024] NVRM: The NVIDIA probe routine failed for 2 device(s).
[Thu Mar  7 11:38:58 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
[Thu Mar  7 11:38:59 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.161.07  Sat Feb 17 23:07:24 UTC 2024
[Thu Mar  7 11:38:59 2024] random: crng init done
[Thu Mar  7 11:38:59 2024] random: 218 urandom warning(s) missed due to ratelimiting
[Thu Mar  7 11:38:59 2024] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[Thu Mar  7 11:38:59 2024] ACPI Warning: \_SB.PCI0.SE0.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[Thu Mar  7 11:39:00 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[Thu Mar  7 11:39:01 2024] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[Thu Mar  7 11:39:01 2024] nvidia-uvm: Loaded the UVM driver, major device number 511.
[Thu Mar  7 11:39:15 2024] loop3: detected capacity change from 0 to 8

Here is some more additional output of dmesg. In this example we now have three GPUs in the server.

Thank you in advance
Best Regards,
Lukas

dcsapak · Mar 7, 2024

can you post your vm config? (qm config ID) ?

lukasmetzner · Mar 7, 2024

Code:

args: -global q35-pcihost.pci-hole64-size=2048G
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 48
cpu: x86-64-v2-AES
efidisk0: local-lvm:vm-104-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:81:00,pcie=1
hostpci1: 0000:01:00,pcie=1
hostpci2: 0000:c1:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 500000
meta: creation-qemu=8.1.5,ctime=1709806467
name: tera2w
net0: virtio=BC:24:11:24:AA:D4,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-104-disk-1,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=66794120-0f17-4598-b278-17eb3ca8632f
sockets: 1
vga: virtio
vmgenid: 0e6f97c9-06f8-424c-bec1-9018b37af5bf

The 'args' parameter is still set from the previous discussions solution, but it did not work with or without it.

Best Regards
Lukas

dcsapak · Mar 8, 2024

it's *probably* not the issue, but could you also try by disabling secure boot in the vm ?

either a new vm with th e'pre enrolled keys' checkbox off or by going into the ovmf menu (by pressing escape) and turning off secure boot

edit: also did you enable above-4g decoding etc. in your mainboard bios?

lukasmetzner · Mar 8, 2024

I have turn offed secure boot in my virtual machine and creating a new one, but it did not work unfortunately. Above-4g decoding was already enabled throughout all experiments.

Can I provide you with any additional information?

Best Regards
Lukas

dcsapak · Mar 8, 2024

hmm could you try with seabios instead of ovmf instead? like the solution here: https://forum.proxmox.com/threads/nvidia-drivers-for-tesla-v100-pcie-32gb-failing-to-load.118292/

lukasmetzner · Mar 18, 2024

Hello,
sorry for replying so late. Using SeaBIOS instead of OVMF fixed our problem. Thank you for all your help!

Best Regards
Lukas

mooselife · Apr 30, 2024

Wasted two days on this and this is what solved it!

borjaurbis · Jun 6, 2024

Hi! Don't know if this is still active but, I changed it to seaBIOs but now I cannot access through ssh or xrdp to the vm as I was doin before...

It just don't make no connection to it...

Any ideas?
Thnx

dcsapak · Jun 6, 2024

borjaurbis said:
Hi! Don't know if this is still active but, I changed it to seaBIOs but now I cannot access through ssh or xrdp to the vm as I was doin before...

It just don't make no connection to it...

Any ideas?
Thnx

are you sure the vm still boots? try removing the gpu to verify

borjaurbis · Jun 6, 2024

yes, for some reason every time I passed through gpus via pcie lost connection with ssh and xrdp, however, this time I just created from scratch a new VM with this config file and it fxxxckin works! Phenomenal, the Default works really well with multigpu. In my case, I am using 4 x A40s and two NVLinks. I don't know if the links are working, but for sure I get the message sudo dmesg | grep -i nvidia of NVlink connecting, plus the nvidia-smi confirmation of the 3 GPUs I tried to passthrough.

Thank you very much, I took me hours, and there is not much info around actually.
Cheers everyone.

nano /etc/pve/local/qemu-server/705.conf
agent: 1
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
hostpci1: 0000:c1:00
hostpci2: 0000:a1:00.0
hostpci3: 0000:c2:00
ide2: local:iso/ubuntu-24.04-desktop-amd64.iso,media=cdrom,size=5971344K
memory: 8198
meta: creation-qemu=8.1.5,ctime=1717670084
name: multigpu-test-seabios
net0: virtio=BC:24:11:08:0E:AB,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-705-disk-0,discard=on,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=69d5944c-a690-4891-9955-e3013f3603da
sockets: 1
vmgenid: a7675d36-732b-4aaf-a91a-36677df941eb

Code:

 sudo dmesg | grep -i nvidia
[    2.853807] nvidia: loading out-of-tree module taints kernel.
[    2.853816] nvidia: module license 'NVIDIA' taints kernel.
[    2.853819] nvidia: module license taints kernel.
[    2.984880] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[    3.178493] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.171.04  Tue Mar 19 20:30:00 UTC 2024
[    3.190987] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.171.04  Tue Mar 19 20:26:16 UTC 2024
[    3.205369] [drm] [nvidia-drm] [GPU ID 0x00000011] Loading driver
[    5.502625] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:11.0 on minor 1
[    5.502747] [drm] [nvidia-drm] [GPU ID 0x0000001b] Loading driver
[    7.755371] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1b.0 on minor 2
[    7.755541] [drm] [nvidia-drm] [GPU ID 0x0000001c] Loading driver
[    9.989135] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1c.0 on minor 3
[   10.146269] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   10.184700] nvidia-uvm: Loaded the UVM driver, major device number 236.

Code:

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:00:11.0 Off |                    0 |
|  0%   30C    P8              22W / 300W |     14MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:00:1B.0 Off |                    0 |
|  0%   29C    P8              15W / 300W |     13MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:00:1C.0 Off |                    0 |
|  0%   30C    P8              25W / 300W |     13MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+


+---------------------------------------------------------------------------------------+
| Processes:
                  |
|  GPU   GI   CI        PID   Type   Process name
       GPU Memory |
|        ID   ID
       Usage      |
|=======================================================================================|
|    0   N/A  N/A      1183      G   /usr/lib/xorg/Xorg
             4MiB |
|    1   N/A  N/A      1183      G   /usr/lib/xorg/Xorg
             4MiB |
|    2   N/A  N/A      1183      G   /usr/lib/xorg/Xorg
             4MiB |
+---------------------------------------------------------------------------------------+

Search

Search

Only one of two NVIDIA GPUs works in a single VM

lukasmetzner

New Member

dcsapak

Proxmox Staff Member

lukasmetzner

New Member

dcsapak

Proxmox Staff Member

lukasmetzner

New Member

dcsapak

Proxmox Staff Member

lukasmetzner

New Member

dcsapak

Proxmox Staff Member

lukasmetzner

New Member

mooselife

Member

borjaurbis

New Member

dcsapak

Proxmox Staff Member

borjaurbis

New Member