Missing GPU

Ryan_Malone

Member
Mar 31, 2024
47
2
8
I have 2 GPUs, an Intel Arc 310 and a 380, installed in a Dell R730. One GPU is successfully passed through to a VM, but the other GPU has disappeared. It was working fine until I added the second GPU and passed it through to the VM. The A310 was being used by Jellyfin and Plex. I didn't even need to configure the LXC config file. As soon as I installed it, it showed up in Plex and Jellyfin and then it added itself to the config file. Somehow the second GPU screwed it up.

If I look in /dev/dri I don't see:
1724786241342.png

When I run LSPCI I can see both cards. I don't know where the GPU is or how I can give access to it again. What am I missing?

When I run vainfo I get some unusual errors:
root@pve2:~# vainfo
Trying display: wayland
Trying display: x11
error: can't connect to X server!
Trying display: drm
libva info: VA-API version 1.22.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/simpledrm_drv_video.so
libva info: va_openDriver() returns -1
vaInitialize failed with error code -1 (unknown libva error),exit
 
Last edited:
Check your IOMMU groups: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation
You cannot share devices from the same group between VMs and/or the Proxmox host dues to isolation constraints. Lots of threads about this on the forum also.

EDIT: Also check with lspci -nnk whether both GPUs have the same numeric ID and are maybe both bound to vfio-pci (which excludes them from the Proxmox host). You don't share any techinical details about what you did and what you motherboard make and model is.
 
Last edited:
Check your IOMMU groups: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation
You cannot share devices from the same group between VMs and/or the Proxmox host dues to isolation constraints. Lots of threads about this on the forum also.

EDIT: Also check with lspci -nnk whether both GPUs have the same numeric ID and are maybe both bound to vfio-pci (which excludes them from the Proxmox host). You don't share any techinical details about what you did and what you motherboard make and model is.
Thanks for the response. I'm running an old Supermicro server with an X10 mobo.

When i run lspci -nnk I get this. I think you're onto something as both devices seem to have the same ID. Not sure how I did that. Any idea how to change this?

07:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A310] [8086:56a6] (rev 05)
Subsystem: Device [172f:4019]
Kernel modules: i915, xe
08:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f92]
Subsystem: Device [172f:4019]
Kernel modules: snd_hda_intel
85:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05)
Subsystem: ASRock Incorporation DG2 [Arc A380] [1849:6006]
Kernel driver in use: vfio-pci
Kernel modules: i915, xe
86:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f92]
Subsystem: ASRock Incorporation DG2 Audio Controller [1849:6006]
Kernel modules: snd_hda_intel

It looks like they're in different IOMMU groups.
─────────┬────────┬──────────────┬────────────┬────────┬──────────────────────────────────────────────────────────
│ class │ device │ id │ iommugroup │ vendor │ device_name
╞══════════╪════════╪══════════════╪════════════╪════════╪══════════════════════════════════════════════════════════
│ 0x010400 │ 0x005d │ 0000:04:00.0 │ 55 │ 0x1000 │ MegaRAID SAS-3 3108 [Invader]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010601 │ 0x8d62 │ 0000:00:11.4 │ 44 │ 0x8086 │ C610/X99 series chipset sSATA Controller [AHCI mode]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010601 │ 0x8d02 │ 0000:00:1f.2 │ 50 │ 0x8086 │ C610/X99 series chipset 6-Port SATA Controller [AHCI mode
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010802 │ 0x540a │ 0000:81:00.0 │ 17 │ 0xc0a9 │ P2 NVMe PCIe SSD
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010802 │ 0x5415 │ 0000:82:00.0 │ 18 │ 0xc0a9 │
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1521 │ 0000:01:00.0 │ 51 │ 0x8086 │ I350 Gigabit Network Connection
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1521 │ 0000:01:00.1 │ 52 │ 0x8086 │ I350 Gigabit Network Connection
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1563 │ 0000:02:00.0 │ 53 │ 0x8086 │ Ethernet Controller X550
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1563 │ 0000:02:00.1 │ 54 │ 0x8086 │ Ethernet Controller X550
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x56a6 │ 0000:07:00.0 │ 59 │ 0x8086 │ DG2 [Arc A310]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x0534 │ 0000:0f:00.0 │ 63 │ 0x102b │ G200eR2
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x56a5 │ 0000:85:00.0 │ 22 │ 0x8086 │ DG2 [Arc A380]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x040300 │ 0x4f92 │ 0000:08:00.0 │ 60 │ 0x8086 │ DG2 Audio Controller
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x040300 │ 0x4f92 │ 0000:86:00.0 │ 23 │ 0x8086 │ DG2 Audio Controller
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
<I truncated the rest of the output>
 
When i run lspci -nnk I get this. I think you're onto something as both devices seem to have the same ID. Not sure how I did that. Any idea how to change this?
There is nothing you can do about that. Maybe don't do early binding to vfio-pci? But you don't seem to do that since the drivers are not both vfio-pci.
07:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A310] [8086:56a6] (rev 05)
Subsystem: Device [172f:4019]
Kernel modules: i915, xe
08:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f92]
Subsystem: Device [172f:4019]
Kernel modules: snd_hda_intel
85:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05)
Subsystem: ASRock Incorporation DG2 [Arc A380] [1849:6006]
Kernel driver in use: vfio-pci
Kernel modules: i915, xe
86:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller [8086:4f92]
Subsystem: ASRock Incorporation DG2 Audio Controller [1849:6006]
Kernel modules: snd_hda_intel
There is no driver in use for one of them. Did you blacklist the i915 and/or xe driver? Don't do that because you need those drivers for one of those GPUs (the one that you pass the the container).
It looks like they're in different IOMMU groups.
─────────┬────────┬──────────────┬────────────┬────────┬──────────────────────────────────────────────────────────
│ class │ device │ id │ iommugroup │ vendor │ device_name
╞══════════╪════════╪══════════════╪════════════╪════════╪══════════════════════════════════════════════════════════
│ 0x010400 │ 0x005d │ 0000:04:00.0 │ 55 │ 0x1000 │ MegaRAID SAS-3 3108 [Invader]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010601 │ 0x8d62 │ 0000:00:11.4 │ 44 │ 0x8086 │ C610/X99 series chipset sSATA Controller [AHCI mode]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010601 │ 0x8d02 │ 0000:00:1f.2 │ 50 │ 0x8086 │ C610/X99 series chipset 6-Port SATA Controller [AHCI mode
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010802 │ 0x540a │ 0000:81:00.0 │ 17 │ 0xc0a9 │ P2 NVMe PCIe SSD
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x010802 │ 0x5415 │ 0000:82:00.0 │ 18 │ 0xc0a9 │
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1521 │ 0000:01:00.0 │ 51 │ 0x8086 │ I350 Gigabit Network Connection
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1521 │ 0000:01:00.1 │ 52 │ 0x8086 │ I350 Gigabit Network Connection
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1563 │ 0000:02:00.0 │ 53 │ 0x8086 │ Ethernet Controller X550
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x020000 │ 0x1563 │ 0000:02:00.1 │ 54 │ 0x8086 │ Ethernet Controller X550
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x56a6 │ 0000:07:00.0 │ 59 │ 0x8086 │ DG2 [Arc A310]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x0534 │ 0000:0f:00.0 │ 63 │ 0x102b │ G200eR2
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x030000 │ 0x56a5 │ 0000:85:00.0 │ 22 │ 0x8086 │ DG2 [Arc A380]
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x040300 │ 0x4f92 │ 0000:08:00.0 │ 60 │ 0x8086 │ DG2 Audio Controller
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
│ 0x040300 │ 0x4f92 │ 0000:86:00.0 │ 23 │ 0x8086 │ DG2 Audio Controller
├──────────┼────────┼──────────────┼────────────┼────────┼──────────────────────────────────────────────────────────
<I truncated the rest of the output>
That's good, then you won't lose one GPU when you passthrough the other. But if you are using pcie_acs_override (check with cat /proc/cmdline), then all bets are off!
 
There is nothing you can do about that. Maybe don't do early binding to vfio-pci? But you don't seem to do that since the drivers are not both vfio-pci.
So the 2nd GPU has been rendered useless?
There is no driver in use for one of them. Did you blacklist the i915 and/or xe driver? Don't do that because you need those drivers for one of those GPUs (the one that you pass the the container).
I didn't blacklist any drivers. In fact, when I originally installed the A310 I didn't need to install any drivers to begin using it as there were already included in the kernel.
That's good, then you won't lose one GPU when you passthrough the other. But if you are using pcie_acs_override (check with cat /proc/cmdline), then all bets are off!
Here is the output of cat /prox/cmdline:

BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:eek:ff,efifb:eek:ff

I can't tell whether this means acs override is on or not. I don't recall ever enabling acs override since I had no idea what that is until I just looked it up!
 
So the 2nd GPU has been rendered useless?
No, because you don't appear to do this. So this is not the problem.
I didn't blacklist any drivers. In fact, when I originally installed the A310 I didn't need to install any drivers to begin using it as there were already included in the kernel.
Still, it looks like you blacklisted drivers. There is no driver is use for the not-passed-through GPU. You might want to double check the files in the /etc/modprobe.d/ directory and/or recreate your initramfs and reboot (if it does not corresponds your that directory).
Here is the output of cat /prox/cmdline:

BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:eek:ff,efifb:eek:ff
I can't tell whether this means acs override is on or not. I don't recall ever enabling acs override since I had no idea what that is until I just looked it up!
video=vesafb:off,efifb:off has been invalid/incorrect for some time now and has no effect. You are using pcie_acs_override so the IOMMO groups information is invalid.

There are two things that might interfere with your GPU for your containers: blacklisting drivers and pcie_acs_override. For both things, you claim not to know anything about them. But both things are not done by Proxmox and need to have been applied specifically at some point.

Maybe ask the person who install and setup your Proxmox or ask the person who provided the scripts that you ran without knowing what they did. Or maybe check you house for carbon-monoxide poisoning that might affect your memory. I don't know how to help if your system has changes that are not automatic but also have not been done by you.
 
No, because you don't appear to do this. So this is not the problem.

Still, it looks like you blacklisted drivers. There is no driver is use for the not-passed-through GPU. You might want to double check the files in the /etc/modprobe.d/ directory and/or recreate your initramfs and reboot (if it does not corresponds your that directory).

video=vesafb:off,efifb:off has been invalid/incorrect for some time now and has no effect. You are using pcie_acs_override so the IOMMO groups information is invalid.

There are two things that might interfere with your GPU for your containers: blacklisting drivers and pcie_acs_override. For both things, you claim not to know anything about them. But both things are not done by Proxmox and need to have been applied specifically at some point.

Maybe ask the person who install and setup your Proxmox or ask the person who provided the scripts that you ran without knowing what they did. Or maybe check you house for carbon-monoxide poisoning that might affect your memory. I don't know how to help if your system has changes that are not automatic but also have not been done by you.
I definitely didn't blacklist any drivers. In fact, I wanted to try and blacklist them when I was having trouble passing through the GPU to the VM and came to this forum to ask how I could do that and never got an answer so I gave up on that. I'll wipe the server, reinstall Proxmox, restore the VMs, and go through the passthrough steps again and maybe that will work.
 
I'll wipe the server, reinstall Proxmox, restore the VMs, and go through the passthrough steps again and maybe that will work.
Or just check the files in /etc/modprobe.d/ for blacklisting? Or maybe find out why there are changes to your Proxmox configuration that you did not make yourself?
 
there is nothing in modprobe so I guess it must be ACS override. Not sure exactly how to remove that other than just delete "pcie_acs_override=downstream,multifunction" from the boot config file and update, but I doubt that will work. I'll give it a go though and reinstall and start from scratch if no joy.

Here is what I see in modprobe:

GNU nano 7.2 /etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

Thanks for figuring out what the actual issue is. PCIe passthrough is a minefield.
 
Here is what I see in modprobe:

GNU nano 7.2 /etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
That file should contain two more lines:
Code:
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
What happens when you load the driver: modprobe xe and modprobe snd_hda_intel? What does lspci -nnks 07:00 show then? Any relevant information in the system log?
 
there is nothing in modprobe so I guess it must be ACS override. Not sure exactly how to remove that other than just delete "pcie_acs_override=downstream,multifunction" from the boot config file and update, but I doubt that will work.
Well, you also need to run the commands to activate the change (see the manual: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline).
Maybe it's the nomodeset and/or nofb, as they might prevent any graphics drivers from loading or something?
 
That file should contain two more lines:
Code:
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
What happens when you load the driver: modprobe xe and modprobe snd_hda_intel? What does lspci -nnks 07:00 show then? Any relevant information in the system log?
Are you suggesting to add those top the modprobe file to blacklist them?

root@pve2:~# lspci -nnks 07:00
07:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A310] [8086:56a6] (rev 05)
Subsystem: Device [172f:4019]
Kernel modules: i915, xe

Regarding the logs, can't find any sort of syslog file to review. I installed a syslog since it wasn't installed.
 
Are you suggesting to add those top the modprobe file to blacklist them?
No I'm asking you to run those commands and show the output. And then check the driver in use,
Regarding the logs, can't find any sort of syslog file to review. I installed a syslog since it wasn't installed.
Use journalctl -b 0 with Proxmox (or Debian or other modern Linux distro's) and scroill with the arrow keys.
 
Because they show in the output of cat /proc/cmdline, they are therefore active.
root@pve2:/# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:eek:ff,efifb:eek:ff
 
No I'm asking you to run those commands and show the output. And then check the driver in use,

Use journalctl -b 0 with Proxmox (or Debian or other modern Linux distro's) and scroill with the arrow keys.
There's not much in there, but this is the only relevant output I could see.

Aug 27 14:00:03 pve2 kernel: Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
 
BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:eek:ff,efifb:eek:ff
It's the same as your previous post. pcie_acs_override is active as are nofb and nomodeset. I don't know what you intended to say or ask here. (But please use (inline) code tags for such outputs.)
Try disabling/removing nofb and nomodeset by remove them and running the commands necessary to activate it (on the next reboot) as described in the manual (which I linked). You or someone else has to have added those kernel parameters before as they are definitely not standard or present out of the box on Proxmox.

There's not much in there, but this is the only relevant output I could see.

Aug 27 14:00:03 pve2 kernel: Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
pcie_acs_override is insecure and I'm happy it informs you about possible security risks (which I warn often about on this forum). But I still don't think it's relevant to your issue.
 
It's the same as your previous post. pcie_acs_override is active as are nofb and nomodeset. I don't know what you intended to say or ask here. (But please use (inline) code tags for such outputs.)
Try disabling/removing nofb and nomodeset by remove them and running the commands necessary to activate it (on the next reboot) as described in the manual (which I linked). You or someone else has to have added those kernel parameters before as they are definitely not standard or present out of the box on Proxmox.
Yeah I haven't gotten the time to do this yet today.
pcie_acs_override is insecure and I'm happy it informs you about possible security risks (which I warn often about on this forum). But I still don't think it's relevant to your issue.
I understood that ACS override would collapse the IOMMU groups and therefore onlyt one GPU could be supported. Hence, the reason why it was only allowing one to be used and not the other. Did I misunderstand the concept?
 
I understood that ACS override would collapse the IOMMU groups and therefore onlyt one GPU could be supported.
I would say the override "breaks" the groups (into many) so you can passthrough devices more separately. With the security risk that devices from the same (original) group can communicate behind your back.
Hence, the reason why it was only allowing one to be used and not the other. Did I misunderstand the concept?
I think you misunderstood. And I still think it is totally not related to the issue of the driver not loading automatically for your other GPU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!