Passthrough GPU Navi 22

masterkaffe

Member
Feb 9, 2020
7
0
21
36
I sincerely apologize for opening another thread with this topic.
I have tried many times to pass the GPU. Unfortunately, I either get a black screen or the reboot of the VM does not survive.
So far it has worked better with Proxmox 7.4 than with Proxmox 8.1
Would it make sense to switch to version 8.x?

At the moment I have a black screen when starting the VM.
Can someone help me please.

Bash:
root@peter:~# efibootmgr -v
BootCurrent: 0005
Timeout: 1 seconds
BootOrder: 0005,0002,0004,0001,0000
Boot0000* proxmox       VenHw(99e275e7-75a0-4b37-a2e6-c5385e6c00cb)
Boot0001* Linux Boot Manager    VenHw(99e275e7-75a0-4b37-a2e6-c5385e6c00cb)
Boot0002* Linux Boot Manager    HD(2,GPT,480a5cd2-e94c-4d97-b452-b00d00301296,0x800,0x200000)/File(\EFI\SYSTEMD\SYSTEMD-BOOTX64.EFI)
Boot0004* Linux Boot Manager    VenHw(99e275e7-75a0-4b37-a2e6-c5385e6c00cb)
Boot0005* UEFI OS       HD(2,GPT,480a5cd2-e94c-4d97-b452-b00d00301296,0x800,0x200000)/File(\EFI\BOOT\BOOTX64.EFI)..BO

Bash:
root@peter:~# cat  /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt initcall_blacklist=sysfb_init amd_iommu=on"
GRUB_CMDLINE_LINUX=""

Bash:
root@peter:~# cat  /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs iommu=pt initcall_blacklist=sysfb_init amd_iommu=on

Bash:
root@peter:~# dmesg | grep -e IOMMU
[    0.595502] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.596768] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.597015] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).

Bash:
root@peter:~# dmesg | grep -i vfio
[    4.209264] VFIO - User Level meta-driver version: 0.3
[    4.211725] vfio-pci 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    4.228404] vfio_pci: add [1002:73df[ffffffff:ffffffff]] class 0x000000/00000000
[    4.248416] vfio_pci: add [1002:ab28[ffffffff:ffffffff]] class 0x000000/00000000
[   10.524967] vfio-pci 0000:0c:00.0: Unsupported reset method 'device_specific'

Bash:
root@peter:~# dmesg | grep 'remapping'
[    0.369002] x2apic: IRQ remapping doesn't support X2APIC mode
[    0.596776] AMD-Vi: Interrupt remapping enabled

Bash:
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev c1)
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]

Bash:
root@peter:~# cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

vfio
vfio_iommu_type1
vfio_pci
vendor-reset

# Generated by sensors-detect on Wed Feb 14 14:38:01 2024
# Chip drivers
nct6775

Bash:
root@peter:~# dmesg | grep vendor_reset
[    4.298070] vendor_reset_hook: installed


Bash:
root@peter:~# systemctl status vreset.service
● vreset.service - AMD GPU reset method to 'device_specific'
     Loaded: loaded (/etc/systemd/system/vreset.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2024-02-17 14:41:39 CET; 12min ago
    Process: 3079 ExecStart=/usr/bin/bash -c echo device_specific > /sys/bus/pci/devices/0000:0c:00.0/reset_method (code=exited, status=1/FAILU>
   Main PID: 3079 (code=exited, status=1/FAILURE)
        CPU: 1ms

Feb 17 14:41:39 peter systemd[1]: Started AMD GPU reset method to 'device_specific'.
Feb 17 14:41:39 peter bash[3079]: /usr/bin/bash: line 1: echo: write error: Invalid argument
Feb 17 14:41:39 peter systemd[1]: vreset.service: Main process exited, code=exited, status=1/FAILURE
Feb 17 14:41:39 peter systemd[1]: vreset.service: Failed with result 'exit-code'.
 
There is really no point using vendor-reset, as it does not support Navi 22. Therefore, troubleshooting your problems with vendor-reset is pointless.
Maybe your have a Navi 22 GPU that does not work with passthrough: https://www.reddit.com/r/VFIO/comments/tq9j5v/need_help_compiling_a_list_of_amd_6000_series/ ?

Did you try adding this to one of the .conf-files in /etc/modprobe.d/ (and run update-initramfs -u and reboot to apply it):
Code:
options vfio-pci ids=1002:73df,1002:ab28
softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
To make sure that the GPU is not touched by its drivers before the VM starts.
Also make sure to boot the Proxmox host with another (integrated) GPU, to make sure the Navi 22 is untouched before the VM starts.
If you do the above steps, you'll won't be able to fix problems using the Proxmox host console if you don't have another (integrated) GPU.
The initcall_blacklist=sysfb_init suggests that you only have the Navi 22 and are using it during the boot process (in which case it is also needed).

PS: amd_iommu=on really does nothing and is even not valid: https://www.kernel.org/doc/html/v6.5/admin-guide/kernel-parameters.html
 
Thank you for your answers. Since I no longer need the vendor-reset module, I have switched to Proxmox 8.1.

Bash:
root@peter:~# uname -a
Linux peter 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64 GNU/Linux

I have changed something else for debugging. The RX 6700 XT GAMING OC 12G is in the first PCIe slot. The GeForce GT 1030 is in the second slot.

Code:
root@peter:~# lspci -nnk | grep -i VGA -A7
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP108 [GeForce GT 1030] [10de:1d01] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd GP108 [GeForce GT 1030] [1458:3799]
        Kernel driver in use: nouveau
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation GP108 High Definition Audio Controller [10de:0fb8] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd GP108 High Definition Audio Controller [1458:3799]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
--
0d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c1)
        Subsystem: Gigabyte Technology Co., Ltd Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1458:232d]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0d:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel


Code:
root@peter:/etc/systemd/system# cat  /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs iommu=pt initcall_blacklist=sysfb_init


I own the RX 6700 XT GAMING OC 12G graphics card. As far as I can tell, the graphics card "should" work.

Bash:
root@peter:/etc/modules-load.d# cat /etc/modules-load.d/modules.conf
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Passthrough GPU Navi 22
options vfio-pci ids=1002:73df,1002:ab28
softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci

# Generated by sensors-detect on Wed Feb 14 14:38:01 2024
# Chip drivers
nct6775


Bash:
root@peter:/etc# cat modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

#vfio
#vfio_iommu_type1
#vfio_pci
#vendor-reset

# Generated by sensors-detect on Wed Feb 14 14:38:01 2024
# Chip drivers
nct6775

Now I can start the VM, with the graphics card as it should be. Unfortunately, the VM does not survive a reboot.
Code:
[ 6653.556880] pcieport 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 6653.559275] pcieport 0000:0c:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 6655.987356] vfio-pci 0000:0d:00.0: not ready 1023ms after resume; waiting
[ 6657.043299] vfio-pci 0000:0d:00.0: not ready 2047ms after resume; waiting
[ 6659.219289] vfio-pci 0000:0d:00.0: not ready 4095ms after resume; waiting
[ 6663.571221] vfio-pci 0000:0d:00.0: not ready 8191ms after resume; waiting
[ 6672.019156] vfio-pci 0000:0d:00.0: not ready 16383ms after resume; waiting
[ 6688.658988] vfio-pci 0000:0d:00.0: not ready 32767ms after resume; waiting
[ 6723.474541] vfio-pci 0000:0d:00.0: not ready 65535ms after resume; giving up
[ 6723.475710] vfio-pci 0000:0d:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 6723.476507] vfio-pci 0000:0d:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 6723.476520] vfio-pci 0000:0d:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 6723.476533] vfio-pci 0000:0d:00.0: No device request channel registered, blocked until released by user
[ 6723.476535] vfio-pci 0000:0d:00.0: Device is currently in use, task "bash" (44318) blocked until device is released
[ 6723.479490] vfio-pci 0000:0d:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 6723.626262] vfio-pci 0000:0d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 6723.654967] pcieport 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 6723.656581] pcieport 0000:0b:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[ 6723.658027] pcieport 0000:0c:00.0: Unable to change power state from D3cold to D0, device inaccessible
 
Last edited:
I have deleted vendor-reset
Bash:
root@peter:~# rm /etc/systemd/system/vreset.service

Now I have the following errors after restarting the VM:

Code:
[ 2489.904166] pcieport 0000:0c:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 2490.912161] pcieport 0000:0c:00.0: retraining failed
[ 2492.164129] pcieport 0000:0c:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 2493.164132] pcieport 0000:0c:00.0: retraining failed
[ 2493.165347] vfio-pci 0000:0d:00.0: not ready 1023ms after bus reset; waiting
[ 2494.212025] vfio-pci 0000:0d:00.0: not ready 2047ms after bus reset; waiting
[ 2496.356027] vfio-pci 0000:0d:00.0: not ready 4095ms after bus reset; waiting
[ 2500.707985] vfio-pci 0000:0d:00.0: not ready 8191ms after bus reset; waiting
[ 2509.155897] vfio-pci 0000:0d:00.0: not ready 16383ms after bus reset; waiting
[ 2526.307694] vfio-pci 0000:0d:00.0: not ready 32767ms after bus reset; waiting
[ 2561.123308] vfio-pci 0000:0d:00.0: not ready 65535ms after bus reset; giving up
[ 2561.263311] vfio-pci 0000:0d:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 2561.287353] vfio-pci 0000:0d:00.0: vfio_bar_restore: reset recovery - restoring BARs

Task viewer: VM 100 - Start Reports :

Code:
()
swtpm_setup: Not overwriting existing state file.
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK OK
 
Since I no longer need the vendor-reset module, I have switched to Proxmox 8.1.
Those two things are unrelated and I don't see what one has to do with the other.
Now I can start the VM, with the graphics card as it should be. Unfortunately, the VM does not survive a reboot.
That means PCIe passthrough works fine, except that your particular (GIGABYTE?) 6700XT GAMING OC 12G does not (FLR) reset properly. You could try one of the other reset methods; use cat /sys/bus/pci/devices/0000:0d:00.0/reset_method to list the supported methods and echo the one you want to try to the same file.

Either find someone who knows a work-around for this specific issue and GPU, change it for another GPU that does reset properly (with or without a work-around) or live with having to restart the Proxmox host each time you shut down the VM.
 
Those two things are unrelated and I don't see what one has to do with the other.
Yes, that's right, it had to do with vendor reset. Thanks for the answer, the graphics card is from GIGABYTE.
That means PCIe passthrough works fine, except that your particular (GIGABYTE?) 6700XT GAMING OC 12G does not (FLR) reset properly. You could try one of the other reset methods; use cat /sys/bus/pci/devices/0000:0d:00.0/reset_method to list the supported methods and echo the one you want to try to the same file.

root@peter:/etc/systemd/system# cat /sys/bus/pci/devices/0000:0d:00.0/reset_method
bus

I tried:
Code:
echo 1 > /sys/bus/pci/devices/0000\:0d\:00.0/remove
echo 1 > /sys/bus/pci/rescan

Unfortunately, it no longer shows me the device.
Code:
root@peter:~# ls /sys/bus/pci/devices/0000:0d:00.0/
ls: cannot access '/sys/bus/pci/devices/0000:0d:00.0/': No such file or directory
Either find someone who knows a work-around for this specific issue and GPU, change it for another GPU that does reset properly (with or without a work-around) or live with having to restart the Proxmox host each time you shut down the VM.

I think I'll have to live with the restart problem.
 
root@peter:/etc/systemd/system# cat /sys/bus/pci/devices/0000:0d:00.0/reset_method
bus

I tried:
Code:
echo 1 > /sys/bus/pci/devices/0000\:0d\:00.0/remove
echo 1 > /sys/bus/pci/rescan
That's about the same as vfio-pci does, since "bus" appears to be the only reset mechanism for that GPU.
I think I'll have to live with the restart problem.
If you don't need the physical output, you could keep the 6700XT on the Proxmox host, install some libraries and use it for virtio-gl/VirGL: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_display . Then you can assign it to multiple Linux VMs (with each at most 512MB of graphics memory) for OpenGL acceleration.
 
Hello friend, I'm wondering if you find a solution. I have a RX6400(Navi 24) passed to a VM running redroid, but from time to time the VM will die with internal-error, and trying to restart the VM returned
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
So I have to restart the host every time this happened...
 
I have a Sapphire 7800XT that suffers from the same problem. Vendor-reset also did not work for me, but I did find a solution that did work.

https://github.com/inga-lovinde/RadeonResetBugFix

The only negative to this solution is that it seems to make desktop icons shrink upon restarting the vm. A simple icon cache reset fixes it though. Reset and shutdown both work fine with this solution.
 
  • Like
Reactions: leesteken
I am done with AMD GPU passthrough, not worth the effort.
I have quite the opposite experience. NVidia seems to discourage passthrought and won't work with the latest kernels. I only buy AMD GPUs that are know to work with passthrough (6800+ and anything supported by vendor-reset).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!