Dual Nvidia black screen, works with one GPU

randoomkiller · Sep 21, 2023

MOBO: Aorus b550i ax Wifi
CPU: Ryzen 7 3700x
GPU : 2x 2080 HP OEM One with a screen(HDMI) and one with a dummy HDMI plugged on it

I got inspired by a friend to set up all my server and homelab needs within a proxmox environment, and thought I'd follow his foodsteps. Most of the times when I'm not home I'm running an ubuntu with cocalc, and possibly with other data science tools next, so I can just access them over the network. However in some cases when I need to edit videos or photos, or (mostly during the breaks) game a bit, and for that I sadly need Windows.

I have 2 2080s, NVlinked for the pooled memory. It works perfectly with CoCalc.

Ideally I'd have the CoCalc Ubuntu and the Windows switch back and forth therefore not causing a collision with the cards and the few extra services I'd run were in a third and fourth VM (NAS, Nginx, mail, own website etc).

I've installed proxmox 7.4, and followed a guide by my friend which is of the following :

edit /etc/default/grub

#to enable PCI passthrough
intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction
#nofb nomodeset video=vesafb:off video=efifb:off
nofb nomodeset video=vesafb:off video=efifb:off

edit /etc/modules

vfiovfio_iommu_type1vfio_pcivfio_virqfd

edit /etc/modprobe.d/pve-blacklist.conf

blacklist nvidiafbblacklist nvidiablacklist radeonblacklist nouveau

run update-grub

After this I've created a VM with q35, and OVMF enabled, efi and tmp disabled, pulled in my old windows install and my old ubuntu install, and the following is happening :

My 2 cards are
09:00.0 and
08:00.0 and by default they are added with pcie enabled, all functions, and ROM-Bar options

If I try to add the 08:00.0 2080, then both VM-s boot up fine but no GPU is detected
When only the 09:00.0 is added, then that gpu is detected by nvidia-smi, and Nvidia Control panel respectively.
When both of them are enabled on a given VM then black screen
If I try to enable the primary gpu for either one of them, I cannot connect to the VMC.

Last time I tried to boot up the windows machine with both gpu-s given to the VM, I got the following dmesg:

Code:

[ 1025.585814] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 1025.585842] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1025.587176] vfio-pci 0000:08:00.0: BAR 3: can't reserve [mem 0xe0000000-0xe1ffffff 64bit pref]
[ 1025.587378] vfio-pci 0000:08:00.0: No more image in the PCI ROM
[ 1025.753804] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 1025.753831] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1030.676765] vfio-pci 0000:08:00.0: No more image in the PCI ROM
[ 1030.676789] vfio-pci 0000:08:00.0: No more image in the PCI ROM

I've also found a tutorial (here) from where I've adopted a few extra lines with no luck

Added extra parameters to

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off"

Edit the module file
VFIO = Virtual Function I/O
nano /etc/modules
Add these lines
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

IOMMU remapping (some systems are not good at mapping the IOMMU, this will help)
nano /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

nano /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1

Adding GPU to VFIO
lspci -v

Look for your GPU and take note of the first set of numbers this is your PCI card address.

Then run this command
lspci -n -s (PCI card address)
This command gives use the GPU vendors number.Use those numbers in this command

nano /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1e82,10de:10f8 disable_vga=1 #HERE I couldnt follow it as both my GPU's had the same ID

Run this command to update everything
update-initramfs -u

It is very likely that I've given it a few too many extra commands, and at the same time I might be missing something. I do not need a physical display for the proxmox as long as I have the webserver, and I can still SSH on it. From what I'm understanding, especially due to the BAR message, it is the host that is not releasing the GPU properly, and hence I'd get an addressing collision(?). My friend was working on it for quite some time although he was using a single 3060 on an intel platform without discrete gpu.

However it is already very fun thing to work with, but I don't know what else should I check. I've found a few clues in this forum, but now I'm trying to narrow it down whether its a platform specific issue(ryzen+dual nvidia card) or a settings issue.

randoomkiller · Sep 23, 2023

Update :
Apparently disabling the drivers is not enough because in my specific case there was another nvidia dependent driver called

i2c_nvidia_gpu

If you happen to have a GPU with an USB-C monitor output it is possible that that causes the nvidia drivers to be cached in the kernel, and therefore still collide with the VM

Blacklist this and then your GPU is no longer being held as a hostage by proxmox

Now I'm trying to make the dual GPU work on a single VM

The error message I'm getting in dmesg now is the following:

Code:

[  605.130338] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[  605.130368] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[  605.131803] vfio-pci 0000:08:00.0: No more image in the PCI ROM
[  605.298331] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[  605.298358] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[  610.081300] vfio-pci 0000:08:00.0: No more image in the PCI ROM
[  610.081328] vfio-pci 0000:08:00.0: No more image in the PCI ROM

randoomkiller · Sep 23, 2023

Now I'm thinking because of the error message, is it possible that there might be an Addressing error?

I found the Ubuntu VM's IP, and could SSH into it when only one card was plugged in ( 08 or 09) but when both the cards were enabled I couldn't even ssh into it, nor can I see anything on the console (it's black). Also after the USB C display driver is disabled by the kernel I can use the GPUs separately in different VMs.

From this I'm guessing that the VM is not starting properly.

Any help is appreciated!

Search

Search

Dual Nvidia black screen, works with one GPU

randoomkiller

New Member

randoomkiller

New Member

randoomkiller

New Member

We value your privacy