Issues moving from NVIDIA RTX3080 GPU passthrough to AMD RX 9070 XT

zacthedev

New Member
Dec 3, 2025
8
0
1
This is a continuation from this post:

I am in the process of migrating from NVIDIA to all AMD and have been unsuccessful in even getting the basic passthrough to Windows and Linux guest OS's working as well as the usual reset bug. I've tried using the script from the original thread but to no avail, I've attached the output from the following commands.


Code:
lspci -tv
lspci
cat /etc/kernel/cmdline
cat /etc/default/grub
cat /etc/modules
cat /etc/modprobe.d/*
cat /etc/pve/qemu-server/923.conf
    for d in $(find /sys/kernel/iommu_groups/ -type l | sort -n -k5 -t/); do
        n=${d#*/iommu_groups/*}; n=${n%%/*}
        printf 'IOMMU Group %s ' "$n"
        lspci -nns "${d##*/}"
    done;
lsmod | grep vfio
dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
dmesg -e
 

Attachments

AMD Ryzen 7 3700X 8-Core Processor
AORUS ELITE WIFI/X570 AORUS ELITE WIFI
0000:0c:00.0 = 252.048 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x16 link at 0000:00:03.1 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
Code:
[  +0.000025] AMD-Vi: Unknown option - 'on'

amd_iommu=on is an unnecessary parameter. Remove it.

Code:
blacklist radeon
blacklist amdgpu
blacklist snd_hda_codec_hdmi

I don't need any of it. Remove it.

Code:
options kvm ignore_msrs=1 report_ignored_msrs=0
options kvm-intel nested=Y

Remove it.
This was the configuration for using nest, I believe. It's also the configuration for Intel CPUs. If you don't use it, please delete it.

Code:
echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf
echo "options kvm-amd nested=1" > /etc/modprobe.d/kvm-amd.conf

If you need the AMD version, it's here.

Code:
update-initramfs -u -k all
proxmox-boot-tool refresh
update-grub
reboot

After deleting, please perform the above steps and test.

* I can say for certain if it's equivalent to what I'm using, but I can't guarantee it will resolve every environment. If it doesn't fix it, I'm at a loss.
 
Last edited:
No change, I can see this in the start command:

Code:
error writing '1' to '/sys/bus/pci/devices/0000:0c:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:0c:00.0', but trying to continue as not all devices need a reset
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK OK

I'm seeing a lot of conflicting information online about what bios setting to enable/disable though. Can you share if you have Above 4G Decoding and Resizable Bar enabled or disabled?
 

Attachments

Yes, the terminal login screen looks fine but once the vm with the passed through gpu starts the display freezes entirely until host system restart. In your environment are you on proxmox 9+? If so, did you ever have this working in 8.4? I also saw mention of having to use the 6.11 kernel instead of the 6.14 kernel, I've tried both but one other oddity I've noticed is that there is never any amdgpu driver that seems to be loaded? I am unable to do a rocm-smi (and it is installed just doesn't see the amdgpu driver) and in your hookscript the lines:

Code:
# Unbind gpu from amdgpu
echo "0000:04:00.0" > /sys/bus/pci/drivers/amdgpu/unbind 2>/dev/null

show as not found. I've looked through all of /sys/bus/pci for mention of amdgpu but never found anything but when checking the kernel module it does show amdgpu, just not as amdgpu in use. Am I supposed to install some kind of driver for this? I thought I had when doing the rocm setup but have seen many posts saying it should be turn key with 6.11+ kernels.
 
the terminal login screen looks fine but once the vm with the passed through gpu starts the display freezes entirely until host system restart.

Since it's pass-through, it's normal for the console to disappear once the virtual machine starts up.

never any amdgpu driver that seems to be loaded?

Please provide the results of running `lspci -vvvs 0000:0c:00.0` and dmesg -e after rebooting.

* Save it as a text file and attach it as a file.

I have not installed rocm-smi or similar software.

*You changed the PCI ID for the Fuchscript to 0000:0c:00.0, right?

If so, did you ever have this working in 8.4?

Yes. It has been confirmed to work on 8.4. Since 9 didn't exist at the time, many people will have verified it.

/sys/bus/pci ...

It exists in my environment. I've installed it multiple times, and it has never been absent.

try a fresh installation
 
Last edited:
Since it's pass-through, it's normal for the console to disappear once the virtual machine starts up.
Yes this is how it worked with multiple nvidia GPU's that have passed through this machine except the display would change to the guest os, with the AMD GPU the display remains on with the host dialogue and frozen cursor.

Please provide the results of running `lspci -vvvs 0000:0c:00.0` and dmesg -e after rebooting.

* Save it as a text file and attach it as a file.

I have not installed rocm-smi or similar software.

*You changed the PCI ID for the Fuchscript to 0000:0c:00.0, right?
Output is attached below, I did update the script to my device's PCI ID
It exists in my environment. I've installed it multiple times, and it has never been absent.

try a fresh installation
So this exists in a 7 node environment, by fresh installing with proxmox 9.x would this particular host be able to join the existing cluster with the other machines still at 8.4 and I can perform rolling upgrades to the other 6 after I confirm the upgrade fixed my GPU passthrough issue?
 

Attachments

On newly installed PVE8 or PVE9 systems, the amdgpu driver will likely be loaded at startup unless you take action.

If your log is the latest one and you've removed amdgpu from the blacklist, yet this log persists, then I suspect the amdgpu driver is malfunctioning.

If it's under my control, I'll try to improve it without reinstalling.

If it's important and needs to be maintained, the cause must be investigated. I'm just a regular person commenting out of curiosity (and not particularly knowledgeable), so I don't want to go that far.

Therefore, I've seen improvements by performing a fresh installation, deleting each person's custom settings, and then reconfiguring them as needed—that's why I suggested this approach.

So this exists in a 7 node environment, by fresh installing with proxmox 9.x would this particular host be able to join the existing cluster with the other machines still at 8.4 and I can perform rolling upgrades to the other 6 after I confirm the upgrade fixed my GPU passthrough issue?

I think all you need to do is perform a fresh installation of 8.4 and upgrade the kernel to 6.14.

Whether it's 7 nodes or not is not relevant to this discussion.
 
Last edited:
Ahh that's a good idea (if you can't tell I haven't gotten much sleep trying to resolve this). I'll migrate the vm's, fresh install, and report back my findings.
 
If you have performed a new installation, please configure only the following and test it.

Code:
apt update && apt upgrade -y
sed -i '1s/$/ quiet iommu=pt/g' /etc/kernel/cmdline
sed -i '/GRUB_CMDLINE_LINUX_DEFAULT=/c GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"' /etc/default/grub


cat << EOF > /etc/modules
vfio
vfio_iommu_type1
vfio_pci
EOF

echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf
echo "options kvm-amd nested=1" > /etc/modprobe.d/kvm-amd.conf


    rx90xx (rx9070xt/rx9070/rx9060xt) PCI pass through


mkdir -p /var/lib/vz/snippets
nano /var/lib/vz/snippets/rx9070_reset.sh
---
#!/bin/bash

phase="$2"
echo "Phase is $phase"
if [ "$phase" == "pre-start" ]; then
    # Unbind gpu from amdgpu
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/amdgpu/unbind 2>/dev/null
    sleep 2
    # Resize the GPU's BAR2 memory region (useful for PCI passthrough)
    # 256MB = 8, 8MB = 3, Windows 11 256MB OK
    echo 3 > /sys/bus/pci/devices/0000:0c:00.0/resource2_resize
    sleep 2
elif [ "$phase" == "post-stop" ]; then
    # Unbind gpu from vfio-pci
    sleep 5
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind 2>/dev/null
    sleep 2
    # Bind amdgpu
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/amdgpu/bind 2>/dev/null
    sleep 2
fi
---

chmod +x /var/lib/vz/snippets/rx9070_reset.sh
cat /var/lib/vz/snippets/rx9070_reset.sh

qm set VMID -args '-cpu host,hv_passthrough,level=16,-hypervisor,+svm'
qm set VMID -hookscript local:snippets/rx9070_reset.sh
qm set VMID -hostpci0 0000:0c:00,pcie=1,rombar=0
qm set VMID -bios ovmf
qm set VMID -cpu host

update-initramfs -u -k all
proxmox-boot-tool refresh
update-grub
reboot
 
Last edited:
If you have performed a new installation, please configure only the following and test it.

Code:
apt update && apt upgrade -y
sed -i '1s/$/ quiet iommu=pt/g' /etc/kernel/cmdline
sed -i '/GRUB_CMDLINE_LINUX_DEFAULT=/c GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"' /etc/default/grub


cat << EOF > /etc/modules
vfio
vfio_iommu_type1
vfio_pci
EOF

echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf
echo "options kvm-amd nested=1" > /etc/modprobe.d/kvm-amd.conf


    rx90xx (rx9070xt/rx9070/rx9060xt) PCI pass through


mkdir -p /var/lib/vz/snippets
nano /var/lib/vz/snippets/rx9070_reset.sh
---
#!/bin/bash

phase="$2"
echo "Phase is $phase"
if [ "$phase" == "pre-start" ]; then
    # Unbind gpu from amdgpu
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/amdgpu/unbind 2>/dev/null
    sleep 2
    # Resize the GPU's BAR2 memory region (useful for PCI passthrough)
    # 256MB = 8, 8MB = 3, Windows 11 256MB OK
    echo 3 > /sys/bus/pci/devices/0000:0c:00.0/resource2_resize
    sleep 2
elif [ "$phase" == "post-stop" ]; then
    # Unbind gpu from vfio-pci
    sleep 5
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind 2>/dev/null
    sleep 2
    # Bind amdgpu
    echo "0000:0c:00.0" > /sys/bus/pci/drivers/amdgpu/bind 2>/dev/null
    sleep 2
fi
---

chmod +x /var/lib/vz/snippets/rx9070_reset.sh
cat /var/lib/vz/snippets/rx9070_reset.sh

qm set VMID -args '-cpu host,hv_passthrough,level=16,-hypervisor,+svm'
qm set VMID -hookscript local:snippets/rx9070_reset.sh
qm set VMID -hostpci0 0000:0c:00,pcie=1,rombar=0
qm set VMID -bios ovmf
qm set VMID -cpu host

update-initramfs -u -k all
proxmox-boot-tool refresh
update-grub
reboot
Alright reinstalled and I did confirm the /sys/bus/pci/drivers/amdgpu/ path now exists. With your script though, I was seeing this error before the reinstall:
sed: can't read /etc/kernel/cmdline: No such file or directory

I went ahead and created a fresh CachyOS VM for this script to run against and it looks like it does detect my gpu and sends a video signal out to the attached monitor (it's a black screen but progress). After quitting the vm the monitor lost signal and turned off, in the system logs I can see this repeated message:

Code:
Dec 04 16:15:53 white-castle kernel: xgpu_nv_mailbox_trans_msg: 2462 callbacks suppressed
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: xgpu_nv_mailbox_trans_msg: 2456 callbacks suppressed
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !
Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !

I am also now unable to start the vm with the gpu attached to it with the message:

Code:
trying to acquire lock...
TASK ERROR: can't lock file '/var/lock/qemu-server/lock-140.conf' - got timeout
 
With your script though, I was seeing this error before the reinstall:
It depends on the bootloader used on your computer, so the cmdline is also written alongside grub.

kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !

The amdgpu driver failed to bind due to an error recorded before that log appeared (during shutdown cleanup after the VM shut down on the host OS).
Probably.
 
I went ahead and created a fresh CachyOS VM for this script to run against and it looks like it does detect my gpu and sends a video signal out to the attached monitor (it's a black screen but progress).
I only tested it on Windows 11, so I don't know if it works on other operating systems.

Did the screen display and function? I won't be testing it on other operating systems, so I don't know what will happen.
 
Ok after further testing I was able to get the display to connect reliably using a custom edid and kernel args on the guest os. I haven't seen this error re appear.
Dec 04 16:15:53 white-castle kernel: xgpu_nv_mailbox_trans_msg: 2462 callbacks suppressed Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:53 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: xgpu_nv_mailbox_trans_msg: 2456 callbacks suppressed Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again ! Dec 04 16:15:58 white-castle kernel: amdgpu 0000:0c:00.0: amdgpu: trn=2 ACK should not assert! wait again !

However I believe the reset bug is still happening intermittently with this error and the guest os no longer able to see the GPU (syslog attached below). I am wondering if there are any other changes needing to be made to the post-stop case of the script where it loops and checks for proper post-stop actions x number of iterations instead of the static time wait?
 

Attachments

the display to connect reliably using a custom edid and kernel args on the guest os.

For future readers of this thread, it would be best to document how to improve this part.

I haven't encountered this issue with Ryzen 7000 or Core Ultra, so I don't think it's specific to RX 90xx or PVE.

After the problem occurs, how does the driver appear in the output of `lspci -vvvs`?

Also, can you recover without restarting the PvE host after executing the following?

Code:
echo 1 > /sys/bus/pci/devices/0000\:0c\:00.0/remove
echo 1 > /sys/bus/pci/rescan