Passthrough GPU (RX5500 XT) causes VM to lock up host.

Whattteva

Member
Feb 16, 2023
29
9
8
Hi all. I have followed this guide for setting up my GPU passthrough and it is either locking up the VM or locking up the entire host when the VM boots. This happens on both MacOS VM as well as Windows VM. I really only tried the Windows VM just to see if

On MacOS VM, the VM just gets killed shortly after booting up and showing the Apple logo. On the Windows VM, it actually works and boots fine..... until I install the AMD drivers, which then cause the entire host computer to lock up. If I uninstall the AMD drivers, it works fine again, but obviously graphics performance sucks. What am I missing here?

System specs:
Supermicro X11-SPI-TF
Intel Xeon Silver 4210T (10c/20t)
224 GB ECC LRDIMM
ASpeed AST2500 BMC VGA
Gigabyte RX5500 XT OC 4G

catroot@pve1:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

Here are some outputs that I think are relevant:
Code:
root@pve1:~# cat /etc/default/grub | grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt nomodeset"

root@pve1:~# dmesg | grep -E DMAR
[ 0.018785] ACPI: DMAR 0x000000006D311108 000148 (v01 SUPERM SMCI--MB 00000001 INTL 20091
013)
[ 0.018842] ACPI: Reserving DMAR table memory at [mem 0x6d311108-0x6d31124f]
[ 0.600029] DMAR: IOMMU enabled
[ 1.470096] DMAR: Host address width 46
[ 1.470097] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[ 1.470111] DMAR: dmar0: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.470115] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[ 1.470120] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.470122] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 1.470126] DMAR: dmar2: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.470128] DMAR: DRHD base: 0x000000aaffc000 flags: 0x1
[ 1.470132] DMAR: dmar3: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.470134] DMAR: RMRR base: 0x0000006f3f6000 end: 0x0000006f406fff
[ 1.470138] DMAR: ATSR flags: 0x0
[ 1.470141] DMAR: RHSA base: 0x000000aaffc000 proximity domain: 0x0
[ 1.470143] DMAR: RHSA base: 0x000000c5ffc000 proximity domain: 0x0
[ 1.470144] DMAR: RHSA base: 0x000000e0ffc000 proximity domain: 0x0
[ 1.470146] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
[ 1.470148] DMAR-IR: IOAPIC id 12 under DRHD base 0xfbffc000 IOMMU 2
[ 1.470150] DMAR-IR: IOAPIC id 11 under DRHD base 0xe0ffc000 IOMMU 1
[ 1.470152] DMAR-IR: IOAPIC id 10 under DRHD base 0xc5ffc000 IOMMU 0
[ 1.470153] DMAR-IR: IOAPIC id 8 under DRHD base 0xaaffc000 IOMMU 3
[ 1.470155] DMAR-IR: IOAPIC id 9 under DRHD base 0xaaffc000 IOMMU 3
[ 1.470156] DMAR-IR: HPET id 0 under DRHD base 0xaaffc000
[ 1.470158] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remap
ping.
[ 1.471238] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 2.153979] DMAR: No SATC found
[ 2.153982] DMAR: dmar2: Using Queued invalidation
[ 2.153989] DMAR: dmar1: Using Queued invalidation
[ 2.153991] DMAR: dmar0: Using Queued invalidation
[ 2.153994] DMAR: dmar3: Using Queued invalidation
[ 2.159552] DMAR: Intel(R) Virtualization Technology for Directed I/O

Code:
root@pve1:~# dmesg | grep -e IOMMU
[    0.599898] DMAR: IOMMU enabled
[    1.471154] DMAR-IR: IOAPIC id 12 under DRHD base  0xfbffc000 IOMMU 2
[    1.471156] DMAR-IR: IOAPIC id 11 under DRHD base  0xe0ffc000 IOMMU 1
[    1.471158] DMAR-IR: IOAPIC id 10 under DRHD base  0xc5ffc000 IOMMU 0
[    1.471159] DMAR-IR: IOAPIC id 8 under DRHD base  0xaaffc000 IOMMU 3
[    1.471161] DMAR-IR: IOAPIC id 9 under DRHD base  0xaaffc000 IOMMU 3

Code:
[    5.643629] vfio_pci: add [1002:1479[ffffffff:ffffffff]] class 0x000000/00000000
root@pve1:~# dmesg | grep -i vfio
[    5.538938] VFIO - User Level meta-driver version: 0.3
[    5.547771] vfio-pci 0000:67:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes
=io+mem:owns=none
[    5.548012] vfio_pci: add [1002:7340[ffffffff:ffffffff]] class 0x000000/00000000
[    5.643586] vfio_pci: add [1002:ab38[ffffffff:ffffffff]] class 0x000000/00000000
[    5.643618] vfio_pci: add [1002:1478[ffffffff:ffffffff]] class 0x000000/00000000
[    5.643629] vfio_pci: add [1002:1479[ffffffff:ffffffff]] class 0x000000/00000000

Code:
root@pve1:~# dmesg | grep "remapping"
[    1.471164] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remap
ping.
[    1.472246] DMAR-IR: Enabled IRQ remapping in x2apic mode

Code:
root@pve1:~# lspci -nn | grep "AMD"
65:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port o
f PCI Express Switch [1002:1478] (rev c5)
66:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port
 of PCI Express Switch [1002:1479]
67:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Ra
deon RX 5500/5500M / Pro 5500M] [1002:7340] (rev c5)
67:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002
:ab38]

Code:
root@pve1:~# dmesg | grep "vendor_reset"
[    5.563920] vendor_reset: loading out-of-tree module taints kernel.
[    5.563925] vendor_reset: module verification failed: signature and/or required key missi
ng - tainting kernel
[    5.585777] vendor_reset_hook: installed

root@pve1:~# cat /etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist snd_hda_codec_hdmi
blacklist snd_hda_intel
blacklist snd_hda_codec
blacklist snd_hda_core
blacklist radeon
blacklist amdgpu

Code:
root@pve1:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:7340,1002:ab38,1002:1478,1002:1479
softdep radeon pre: vfio-pci
softdep amdgpu pre: vfio-pci
 
Last edited:
Hello Whattteva! Unfortunately, some of your output is missing (empty Code blocks).

I did not follow the tutorial you linked to, but I just want to mention that there's a special chapter for PCIe passthrough in the Proxmox VE documentation, and another wiki page with further information. Make sure to read them as well.

Could you please provide us with:
  1. The hardware configuration of the host.
  2. The VM configuration of the Windows and macOS VMs.
  3. The output of pveversion -v
 
Sorry about that. Not sure what happened there, but I've edited the post to have the rest of the information. I've condensed the really long ones in spoiler tags for better readability/navigability.
 
Last edited:
Thanks for the additional info. I would still need the following info:
  1. The VM configuration of the Windows and macOS VMs. In other words, the output of qm config <VMID> --current
  2. The output of lsmod | grep vfio
  3. The output of lspci -nnk
  4. Do you have any logs from the machines that crash? E.g. from the Event Viewer in Windows. Also, are you able to reproduce the same issues with a Linux machine? We may get better logs from Linux.
 
Thanks for the additional info. I would still need the following info:
  1. The VM configuration of the Windows and macOS VMs. In other words, the output of qm config <VMID> --current
  2. The output of lsmod | grep vfio
  3. The output of lspci -nnk
  4. Do you have any logs from the machines that crash? E.g. from the Event Viewer in Windows. Also, are you able to reproduce the same issues with a Linux machine? We may get better logs from Linux.

Thanks for reply. There is nothing much of value from the Windows VM EventViewer other than unexpected reboot with no further details. I haven't run it in Linux VM cause I'm not interested in passing through the card to a Linux VM as it is not the use case I need, but I could do it if that will help troubleshoot the issue. Which distro would be best for this?

Code:
root@pve1:~# lsmod | grep vfio
vfio_pci               16384  1
vfio_pci_core          86016  1 vfio_pci
irqbypass              12288  97 vfio_pci_core,kvm
vfio_iommu_type1       49152  1
vfio                   65536  8 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd                94208  1 vfio

root@pve1:~# qm config 209 --current
agent: 1
args: -device isa-applesmc,osk="ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc" -smbios type=2 -dev
ice usb-kbd,bus=ehci.0,port=2 -device usb-mouse,bus=ehci.0,port=3 -cpu host,kvm=on,vendor=GenuineIntel,+kvm_pv_unhalt
,+kvm_pv_eoi,+hypervisor,+invtsc -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off
autostart: 0
bios: ovmf
boot: order=ide0;virtio0
cores: 4
cpu: host
description: Hackintosh VM - Ventura%0Avga%3A vmware
efidisk0: tank1-vm:vm-209-disk-0,size=1M
hostpci0: 0000:67:00,pcie=1
ide0: tank1-vm:vm-209-disk-1,cache=unsafe,size=80M
ide2: tank1-vm:vm-209-disk-2,cache=unsafe,size=800M
machine: q35
memory: 16384
meta: creation-qemu=7.1.0,ctime=1677062112
name: MacOS
net0: vmxnet3=06:C5:25:97:F7:75,bridge=vmbr0
numa: 0
onboot: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=a41181a0-b20d-4821-881a-c18b84cd4b5d
sockets: 1
tablet: 1
usb0: host=046d:c52b,usb3=1
vga: none
virtio0: tank1-vm:vm-209-disk-3,cache=none,discard=on,size=64G
vmgenid: f80f15bf-418f-4827-90db-94f365a93e57

I have to add this as a file as it is too large:
 

Attachments

Last edited:
. I haven't run it in Linux VM cause I'm not interested in passing through the card to a Linux VM as it is not the use case I need, but I could do it if that will help troubleshoot the issue. Which distro would be best for this?
No need to install or create a new VM. Just try booting your VM with the Ubuntu 24.04 LTS installer ISO and see it boots to a graphical desktop (without installing) to see if passthrough works in principle. I cannot help with MacOS, sorry.

EDIT: Make sure to enable device_specific when using vendor-reset with recent Linux kernels.
 
Last edited:
Thanks for the infos! Some ideas:
  1. Do you see any errors in the host's journal? Try to reproduce such a freeze, then after rebooting, look at the journal of the last boot with journalctl -b -1. Please post the output (as an attachment).
  2. Please do your experiments on Windows or Linux (preferred) first. Since macOS has not necessarily been primarily made to work on Apple computers, I think it's easier to experiment with Linux and Windows. After getting it to work on one of these, you can still to experiment with macOS, if you really want. For Linux, you can use any distro, e.g. Debian or Ubuntu.
  3. Maybe also worth a try: try to update the motherboard's BIOS, as newer versions sometimes offer improved support for GPU passthrough / IOMMU. Keep in mind that you will probably need to reconfigure your BIOS afterwards.
  4. There are some AMD-specific issues, and it seems that your GPU model might also be affected. The guide I linked to explains what can be done against it, but before that, the journal log collected in step 1 above should give us some information whether that is actually the issue.
 
Last edited:
No need to install or create a new VM. Just try booting your VM with the Ubuntu 24.04 LTS installer ISO and see it boots to a graphical desktop (without installing) to see if passthrough works in principle. I cannot help with MacOS, sorry.
Seems like Linux VM locks up the host just like Windows. Seems like MacOS is the only one that crashes in a benign way (just kills itself rather than locking up the host).
EDIT: Make sure to enable device_specific when using vendor-reset with recent Linux kernels.
Yep, it is using device_specific option:
Code:
root@pve1:~# cat /etc/systemd/system/vreset.service  
[Unit]
Description=AMD GPU reset method to 'device_specific'
After=multi-user.target
[Service]
ExecStart=/usr/bin/bash -c 'echo device_specific > /sys/bus/pci/devices/0000:67:00.0/reset_m
ethod'
[Install]
WantedBy=multi-user.target
 
Seems like Linux VM locks up the host just like Windows.
This is typically because of the IOMMU groups but I though you already checked that: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation . And you don't appear to be using the pcie_acs_override. Do you see the expected vendor-reset messages that mention Navi 14 when resetting your GPU in the journalctl when starting the VM? Maybe try a different PCIe slot? Maybe update the motherboard BIOS?
 
Thanks for the infos! Some ideas:
  1. Do you see any errors in the host's journal? Try to reproduce such a freeze, then after rebooting, look at the journal of the last boot with journalctl -b -1. Please post the output (as an attachment).
This looks like it may have something to do with it (file attached).
  1. Please do your experiments on Windows or Linux (preferred) first. Since macOS has not necessarily been primarily made to work on Apple computers, I think it's easier to experiment with Linux and Windows. After getting it to work on one of these, you can still to experiment with macOS, if you really want. For Linux, you can use any distro, e.g. Debian or Ubuntu.
The journal log attached below is for an Ubuntu VM.
  1. Maybe also worth a try: try to update the motherboard's BIOS, as newer versions sometimes offer improved support for GPU passthrough / IOMMU. Keep in mind that you will probably need to reconfigure your BIOS afterwards.
I'll see if I can do this tonight.
  1. There are some AMD-specific issues, and it seems that your GPU model might also be affected. The guide I linked to explains what can be done against it, but before that, the journal log collected in step 1 above should give us some information whether that is actually the issue.
Yeah I've already installed the vendor_reset kernel module with the service file listed in my earlier post above.
27:12 pve1 kernel: Code: Unable to access opcode bytes at 0x77dadb8deeec.
4373 Feb 20 11:27:12 pve1 kernel: RSP: 002b:000077dad73faf60 EFLAGS: 00000246 ORIG_RAX: 0000
0000000000ca
4374 Feb 20 11:27:12 pve1 kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 000077dad
b8def16
4375 Feb 20 11:27:12 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005ce90
13c1098
4376 Feb 20 11:27:12 pve1 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000f
fffffff
4377 Feb 20 11:27:12 pve1 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005ce8d
d082c00
4378 Feb 20 11:27:12 pve1 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 00005ce90
13c1098
4379 Feb 20 11:27:12 pve1 kernel: </TASK>
 

Attachments

This is typically because of the IOMMU groups but I though you already checked that: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation . And you don't appear to be using the pcie_acs_override. Do you see the expected vendor-reset messages that mention Navi 14 when resetting your GPU in the journalctl when starting the VM? Maybe try a different PCIe slot? Maybe update the motherboard BIOS?
I think they're all isolated. The IOMMU group list is attached in the file below. I don't see anything else in the Navi 14 group (my GPU), which is listed in group 5 and group 6 for the HDMI audio. Am I looking at this correctly? I don't have the override set because if I understand this correctly, it's not required because my system is already using separate groups natively?
 

Attachments

I think they're all isolated. The IOMMU group list is attached in the file below. I don't see anything else in the Navi 14 group (my GPU), which is listed in group 5 and group 6 for the HDMI audio. Am I looking at this correctly? I don't have the override set because if I understand this correctly, it's not required because my system is already using separate groups natively?
It does not make sense for me, sorry. I assume that if you used some other (older, like AM4 X570S) platform it would work fine. Then again, some GPU's just don't work properly with passthrough and they can take down the host (as they are real hardware connected to the central PCIe bus). Out of ideas, sorry.
 
It does not make sense for me, sorry. I assume that if you used some other (older, like AM4 X570S) platform it would work fine. Then again, some GPU's just don't work properly with passthrough and they can take down the host (as they are real hardware connected to the central PCIe bus). Out of ideas, sorry.
No worries. Thanks for the help. I really appreciate it. I just hope that it's not the GPU cause I bought this specifically for the passthrough.

For what it's worth, the system already passes through a PCIe LSI 9205-8i card to a TrueNAS VM and it has been working flawlessly for over a year, so I know at least it supports some sort of passthrough already though I'm aware that GPU is generally more complicated.