Another Post About Passing Through A 7900 XTX - PCI Errors - IO_PAGE_FAULT

dizzydre21

New Member
Apr 10, 2023
24
0
1
Hello,

I have been struggling for a couple days to get a new 7900 XTX to work in an EndeavourOS VM. I've read through many posts on this forum as well as on Reddit.

I also have a 4070ti Super in my system that has been working for months. I got tired of the Nvidia issues in EndeavourOS, so I have added the 7900XTX for it.

Anyways, I was getting tons of correctable PCI errors at first in Proxmox syslog and my IPMI logs. The errors only occurred when booting the VM. Whenever EndeavourOS would actually boot, it would not load amdgpu at all and was only accessible with ssh. Other times it would not boot and would eventually time out with exit code 1. After adding the romfile to my VM config and rebooting, the PCI errors seem to have went away, but now I have two other errors:

pve kernel: vfio-pci 0000:84:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x006a address=0xf7900747000 flags=0x0020]
pve kernel: vfio-pci 0000:84:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x006a address=0xf7900767000 flags=0x0020]

EndeavourOS now boots and the 7900 XTX is recognized, with amdgpu being the driver used. I can remote in via Sunshine/Moonlight and it seems happy. I just want to resolve the errors and make sure I'm good to go.

I have CSM, Above 4G decoding, and Re-Size BAR Support enabled in my BIOS.

My Grub Params:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt initcall_blacklist=acpi_cpufreq_init amd_pstate.shared_mem=1 amd_pstate=active"

VM Config:
Code:
affinity: 0-11
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 12
cpu: host
efidisk0: tank_nvme:vm-104-disk-0,efitype=4m,size=1M
hostpci0: 0000:42:11.2,pcie=1
hostpci1: 0000:84:00,pcie=1,romfile=7900xtx.rom
machine: q35
memory: 16384
meta: creation-qemu=8.1.2,ctime=1707322107
name: EndeavourOS-AMD
numa: 0
ostype: l26
scsi0: tank_nvme:vm-104-disk-1,discard=on,iothread=1,size=32G,ssd=1
scsi1: tank_nvme:vm-104-disk-2,discard=on,iothread=1,size=64G,ssd=1
scsi2: storage_nvme:vm-104-disk-0,backup=0,discard=on,iothread=1,size=512G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=2b620f67-9099-476f-b8a4-6b845bc8524d
sockets: 1
vga: none
vmgenid: 9fb59840-4171-437f-b034-d459120f4f11

OS and Hardware:
Proxmox 8.1.4
Motherboard - Asrock Rack Rome8d-2t
CPU - Epyc 7443p
RAM - 256GB 3200MHZ ECC
GPU1 - Asus TUF RTX-4070ti Super, PCIe7
GPU2 - Sapphire Pulse 7900 XTX, PCIe1
OS Drive - Samsung 980 Pro 500GB
VM OS Drives - ZFS Mirror 2x960GB Samsung P9A3, Oculink 1/2
VM Storage - ZFS Striped Mirror 4x1TB WD SN850, bifurcation card PCIe5
TrueNAS Drives - 4x12TB Seagate x16, SATA 4-7
PCIe NIC - 82599ES 10Gbe - passed through to TrueNAS, PCIe4
 
I have CSM, Above 4G decoding, and Re-Size BAR Support enabled in my BIOS.
Resizable bar is not (yet) supported and the current advise it to disable it.
My Grub Params:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt initcall_blacklist=acpi_cpufreq_init amd_pstate.shared_mem=1 amd_pstate=active"
amd_iommu=on is nonsense as it is on by default (and officially is not a valid value).
 
Resizable bar is not (yet) supported and the current advise it to disable it.

amd_iommu=on is nonsense as it is on by default (and officially is not a valid value).
Thanks for the quick response. I was aware amd_iommu=on didn't do anything, though I wasn't at the time that I added it. I just removed it ran update-grub. I also disabled Re-Size BAR Support in my BIOS.

I did not get any errors upon boot and haven't for several reboots. I ran a couple games and the GPU seems to be doing it's thing. I would like to mention that I did have a reboot just after posting that did not have any errors either, but perhaps it was coincidence.

Is there any indication on when Resizeable BAR will be supported? I've had it enabled in my BIOS since I built the machine. At one point, I did have some PCIe errors on the same PCIe port that the GPU is in, but at the time it had a bifurcation card in it. It was a cheap card, so I replaced it with a nice one and the errors went away. Think there could be any relation?
 
Last edited:
You got me to search the internet for this and I found: https://angrysysadmins.tech/index.p...eable-bar-rebar-in-your-vfio-virtual-machine/ and https://gitlab.com/qemu-project/qemu/-/issues/703 . Maybe there is more on this Proxmox forum as well?

EDIT: The script from first link seem to work (with a 6950XT on a Linux VM).
I will have to read into those.

I don't know if I missed it or it just came back up after rebooting EndeavourOS a few times, but I still have this error:

AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x006a address=0xf7900746f40 flags=0x0020]

My config is still the same as my previous post, so I don't know what's causing it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!