[SOLVED] AMD GPU inaccessible after VM Poweroff: Unable to change power state from D3cold to D0,device inaccessible.

houiin

Member
Dec 28, 2020
5
5
8
Hi Mates!

My AMD GPU can't be accessed after VM poweroff.

When I first turn on the host, start VM, the GPU which is passthroughed can work normally ( for me the GPU is 5700XT), and I can also see the VM screen on my monitor.

But after I shutdown the VM, I would see this error: vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible . And I can't re-start the VM because this problem until I restart the host.

kvm: ../hw/pci/pci.c:1613: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.

I googled for weeks but I can't solve it. Maybe you can provide some ideas and help :)


I have tried a lot , but nothing worked :(
1. [NOT WORK] Mannually release/attach the PCIE device
2. [NOT WORK] Installing vendor-reset on Proxmox - Working around the AMD GPU Reset bug on Proxmox using vendor-reset
3. [NOT WORK] Turn off Resize BAR : Successfully Passthrough Sapphire Pulse RX 6700XT (12GB) to win 11 on Proxmox 7.2 (also fixes error 43 on windows while installing drivers) : r/VFIO (reddit.com)
4. [NOT WORK] Add initcall_blacklist=sysfb_init to kernel parameter
5. [NOT WORK] Add disable_idle_d3=1 , amdgpu.runpm=0 to kernel parameter

These problems looked similar to mine, but I felt completely different. What I need to solve is the inaccessible problem of the PCIE device after the VM shutting down.

Looking forward to your generous answers.:):)

Hardware Info:

CPU: I5-10400
Motherboard: Gigabyte B460M AOURS PRO
GPU: XFX 5700XT Ultra & UHD630

BIOS Settings:​

BIOS Version: F7, BIOS Date 06/27/2023 (the latest version)
Above 4G DecodingEnabled
Resize BAR SupportDisabled
ErPDisabled
CSM SupportEnabled
Internal GraphicsEnabled
Platform Power Management
-- PEG ASPM
-- PCH ASPM
-- DMI ASPM
Enabled
-- Enbaled
-- Enbaled
-- Enbaled
Initial Display OutputIGFX (iGPU)
RC6 (Render Standby)Enbaled


PVE Version​

Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-4-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.3
pve-kernel-6.2.16-4-pve: 6.2.16-5
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.6
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.4
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

PCIe Passthrough Settings​

/etc/modprobe.d/blacklist.conf
blacklist nvidiafb
blacklist nouveau
blacklist nvidia
blacklist radeon
blacklist amdgpu
/etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:1478,1002:1479,1002:731f,1002:ab38 disable_vga=1 disable_idle_d3=1[/CODE]
/etc/default/grub :
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_gvt=1 amdgpu.runpm=0 initcall_blacklist=sysfb_init"
/etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0
/etc/modprobe.d/dkms.conf
# nothing.... Just a empty file

lspci -kk

code_language.shell:
00:00.0 Host bridge: Intel Corporation Comet Lake-S 6c Host Bridge/DRAM Controller (rev 03)
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake-S 6c Host Bridge/DRAM Controller
    Kernel driver in use: skl_uncore
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 03)
    Subsystem: Gigabyte Technology Co., Ltd 6th-10th Gen Core Processor PCIe Controller (x16)
    Kernel driver in use: pcieport
00:02.0 VGA compatible controller: Intel Corporation CometLake-S GT2 [UHD Graphics 630] (rev 03)
    DeviceName: Onboard - Video
    Subsystem: Gigabyte Technology Co., Ltd CometLake-S GT2 [UHD Graphics 630]
    Kernel driver in use: i915
    Kernel modules: i915
00:14.0 USB controller: Intel Corporation Comet Lake PCH-V USB Controller
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCH-V USB Controller
    Kernel driver in use: xhci_hcd
    Kernel modules: xhci_pci
00:14.2 Signal processing controller: Intel Corporation Comet Lake PCH-V Thermal Subsystem
    DeviceName: Onboard - Other
    Subsystem: Intel Corporation Comet Lake PCH-V Thermal Subsystem
00:16.0 Communication controller: Intel Corporation Comet Lake PCH-V HECI Controller
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCH-V HECI Controller
    Kernel driver in use: mei_me
    Kernel modules: mei_me
00:17.0 SATA controller: Intel Corporation 400 Series Chipset Family SATA AHCI Controller
    DeviceName: Onboard - SATA
    Subsystem: Gigabyte Technology Co., Ltd 400 Series Chipset Family SATA AHCI Controller
    Kernel driver in use: ahci
    Kernel modules: ahci
00:1b.0 PCI bridge: Intel Corporation Device a3e9 (rev f0)
    Subsystem: Gigabyte Technology Co., Ltd Device 5001
    Kernel driver in use: pcieport
00:1b.4 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #21 (rev f0)
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCI Express Root Port
    Kernel driver in use: pcieport
00:1c.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #05 (rev f0)
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCI Express Root Port
    Kernel driver in use: pcieport
00:1d.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port 9 (rev f0)
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCI Express Root Port 9
    Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation B460 Chipset LPC/eSPI Controller
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd B460 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation Cannon Lake PCH Power Management Controller
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd Cannon Lake PCH Power Management Controller
00:1f.3 Audio device: Intel Corporation Comet Lake PCH-V cAVS
    DeviceName: Onboard - Sound
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCH-V cAVS
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel, snd_soc_avs, snd_sof_pci_intel_cnl
00:1f.4 SMBus: Intel Corporation Comet Lake PCH-V SMBus Host Controller
    DeviceName: Onboard - Other
    Subsystem: Gigabyte Technology Co., Ltd Comet Lake PCH-V SMBus Host Controller
    Kernel driver in use: i801_smbus
    Kernel modules: i2c_i801
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (12) I219-V
    DeviceName: Onboard - Ethernet
    Subsystem: Gigabyte Technology Co., Ltd Ethernet Connection (12) I219-V
    Kernel driver in use: e1000e
    Kernel modules: e1000e
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
    Kernel driver in use: pcieport
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
    Kernel driver in use: pcieport
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
    Subsystem: XFX Pine Group Inc. RX 5700 XT RAW II
    Kernel driver in use: vfio-pci
    Kernel modules: amdgpu
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
    Kernel driver in use: vfio-pci
    Kernel modules: snd_hda_intel
05:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)
    Subsystem: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive
    Kernel driver in use: nvme
    Kernel modules: nvme
07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963
    Subsystem: Samsung Electronics Co Ltd SM963 2.5" NVMe PCIe SSD
    Kernel driver in use: nvme
    Kernel modules: nvme




Full dmesg outputs : See the attach file


Start VM with AMD GPU Passthrough and then Turn off the VM
[ 247.873508] device tap1502i0 entered promiscuous mode
[ 247.890081] vmbr0: port 2(tap1502i0) entered blocking state
[ 247.890084] vmbr0: port 2(tap1502i0) entered disabled state
[ 247.890174] vmbr0: port 2(tap1502i0) entered blocking state
[ 247.890176] vmbr0: port 2(tap1502i0) entered forwarding state
[ 248.781578] vfio-pci 0000:03:00.0: enabling device (0000 -> 0003)
[ 248.781802] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 248.781810] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 248.781813] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[ 248.781814] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[ 248.781815] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[ 248.802828] vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
[ 282.585658] vmbr0: port 2(tap1502i0) entered disabled state
[ 284.800841] pcieport 0000:02:00.0: Data Link Layer Link Active not set in 1000 msec
[ 284.922295] vfio-pci 0000:03:00.1: Unable to change power state from D0 to D3hot, device inaccessible
[ 284.922990] vfio-pci 0000:03:00.0: Unable to change power state from D0 to D3hot, device inaccessible

And try to start VM again
Code:
root@pve:~# qm start 1502
WARN: no efidisk configured! Using temporary efivars disk.
kvm: ../hw/pci/pci.c:1613: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
start failed: QEMU exited with code 1
[ 368.198631] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 368.199455] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 368.202901] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 368.871106] device tap1502i0 entered promiscuous mode
[ 368.889539] vmbr0: port 2(tap1502i0) entered blocking state
[ 368.889542] vmbr0: port 2(tap1502i0) entered disabled state
[ 368.889621] vmbr0: port 2(tap1502i0) entered blocking state
[ 368.889622] vmbr0: port 2(tap1502i0) entered forwarding state
[ 369.756871] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 369.756905] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 369.756965] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 369.759152] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759155] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759156] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
------------A lot of "vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff" output--------------

[ 369.759212] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759213] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759215] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759216] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759218] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759219] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759221] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759222] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759224] vfio-pci 0000:03:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 369.759226] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0x100
[ 369.759227] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.759228] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.759230] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
------------A lot of " vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc" output--------------
[ 369.760176] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.760177] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.760257] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.760258] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 369.790707] vmbr0: port 2(tap1502i0) entered disabled state
[ 369.790862] vmbr0: port 2(tap1502i0) entered disabled state
[ 369.978904] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 369.979869] vfio-pci 0000:03:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 369.980686] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 369.981348] vfio-pci 0000:03:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 372.046793] pcieport 0000:02:00.0: Data Link Layer Link Active not set in 1000 msec
[ 372.062717] vfio-pci 0000:03:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 372.062753] vfio-pci 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
 

Attachments

  • Full_dmesg.txt
    249.2 KB · Views: 1
  • lspci_kk.txt
    4.7 KB · Views: 0
Last edited:
  • Like
Reactions: semanticbeeng
I found a solution,
echo "0" | tee /sys/bus/pci/devices/0000\:03\:00.0/d3cold_allowed to disable this device enter d3cold status.

But it is not perfect. Because the GPU cannot enter the d3cold state, it will consume more power.
 
Are you sure vendor-reset does not fix this when you activate it for the GPU? You probably did this correctly, but sometimes people forget that (or newer kernels).
Thanks a lot !!!! It works!!!
I forgot to activate that module for my AMD GPU
AFTER loading the module (after dmesg says vendor_reset_hook: installed),
run echo 'device_specific' > /sys/bus/pci/devices/<pci_device_id_here>/reset_method as root privilege.
 
  • Like
Reactions: leesteken
Are you sure vendor-reset does not fix this when you activate it for the GPU? You probably did this correctly, but sometimes people forget that (or newer kernels).

The GPU consumes around 80W in suspend state, but when I passthrough the GPU into the VM, GPU in active state and consumes only 30W. When I turned off the VM, GPU switch to suspend state again and consume more power.

Strange...
 
  • Like
Reactions: semanticbeeng
The GPU consumes around 80W in suspend state, but when I passthrough the GPU into the VM, GPU in active state and consumes only 30W. When I turned off the VM, GPU switch to suspend state again and consume more power.

Strange...
The generic PCIe bus does not know enough about the GPU to turn off parts of it. You need to have a driver loaded that knows how to do that. It has been discussed on this forum before and people usually start a small Linux VM to power down the GPU or just let their VMs run idle to reduce power instead.
 
Thanks a lot !!!! It works!!!
I forgot to activate that module for my AMD GPU
I have the same problem, basically exactly the same error codes for my 5700xt as well, How did you resolve it, Can you repost the full code that worked?
 
I have the same problem, basically exactly the same error codes for my 5700xt as well, How did you resolve it, Can you repost the full code that worked?
Updated 1/10/2024 : Vendor-Reset not work on kernel version > 6.2.x


### Step1 Clone vendor-reset to local disk

Bash:
git clone https://github.com/gnif/vendor-reset.git


### Step2 Install the module

cd to the folder you git cloned, and run dkms install .
Once it finished, add vendor-reset to /etc/modules at the first line to make sure vendor-reset can be loaded firstly.
Snipaste_2024-01-10_11-29-13.png
Then update-grub &&update-initramfs -k all -u

### Step3 Check and set flag
Reboot PVE machine, open the console and run dmesg | grep "vendor", you may see :
root@pve-yin:~# dmesg | grep "vendor"
[ 5.826563] vendor_reset: loading out-of-tree module taints kernel.
[ 5.826618] vendor_reset: module verification failed: signature and/or required key missing - tainting kernel
[ 5.869834] vendor_reset_hook: installed <------------------ Vendor-Reset is loaded sucessfully.
Set flag to enable vendor-reset for devices you want
Bash:
# My GPU Bus Address is 0000:03:00.0
echo 'device_specific' > /sys/bus/pci/devices/0000\:03\:00.0/reset_method
 
Updated 1/10/2024 : Vendor-Reset not work on kernel version > 6.2.x
Please don't use kernel version 6.2 (or 5.19) because they get no more updates and security fixes (for a long time now). Use the supported Proxmox kernel 6.5 instead.
Set flag to enable vendor-reset for devices you want
Bash:
# My GPU Bus Address is 0000:03:00.0
echo 'device_specific' > /sys/bus/pci/devices/0000\:03\:00.0/reset_method
This is normal behavior since kernel 5.15: https://github.com/gnif/vendor-reset/issues/46#issuecomment-992282166
 
  • Like
Reactions: semanticbeeng
I have this issue with an rtx 3060 ti, the gpu does not reset once the VM is disconnected, any help?
 
I have this issue with an rtx 3060 ti, the gpu does not reset once the VM is disconnected, any help?
I have a similar issue with a GTX 1050Ti. Vendor-reset seems to only work with AMD GPUs. If you find a solution to the problem, could you please let me know how to make it work?
 
I have this issue with an rtx 3060 ti, the gpu does not reset once the VM is disconnected, any help?
Kernel 6.5.11-7-pve
The same error with my 3070Ti
Code:
[  693.378357] vfio-pci 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
 
Updated 1/10/2024 : Vendor-Reset not work on kernel version > 6.2.x


### Step1 Clone vendor-reset to local disk

Bash:
git clone https://github.com/gnif/vendor-reset.git


### Step2 Install the module

cd to the folder you git cloned, and run dkms install .
Once it finished, add vendor-reset to /etc/modules at the first line to make sure vendor-reset can be loaded firstly.
View attachment 61098
Then update-grub &&update-initramfs -k all -u

### Step3 Check and set flag
Reboot PVE machine, open the console and run dmesg | grep "vendor", you may see :

Set flag to enable vendor-reset for devices you want
Bash:
# My GPU Bus Address is 0000:03:00.0
echo 'device_specific' > /sys/bus/pci/devices/0000\:03\:00.0/reset_method


Code:
root@tomi:~# dmesg | grep "vendor"
[    1.368300] [Hardware Error]:   vendor_id: 0x8086, device_id: 0x6f00

I've got this when I ran dmesg | grep "vendor how to resolve it?
 
Code:
root@tomi:~# dmesg | grep "vendor"
[    1.368300] [Hardware Error]:   vendor_id: 0x8086, device_id: 0x6f00

I've got this when I ran dmesg | grep "vendor how to resolve it?
This is not related to vendor-reset. It looks like have a problem with an Intel devices (vendor 0x8086). Read the whole log around that time to see what it might be and then search the internet (as this does not look Proxmox specific).
EDIT: Maybe it's your Xeon CPU: https://devicehunt.com/view/type/pci/vendor/8086/device/6F00
 
Last edited:
This is not related to vendor-reset. It looks like have a problem with an Intel devices (vendor 0x8086). Read the whole log around that time to see what it might be and then search the internet (as this does not look Proxmox specific).
EDIT: Maybe it's your Xeon CPU: https://devicehunt.com/view/type/pci/vendor/8086/device/6F00
Thanks a lot! It worked!
Here are all the codes I used that worked for me, (5700xt) (dual 2680v4)

Step 0: Allow your Proxmox to PCI passthrough.​

I used this guide from Craft Computing's youtube video to set up my PCIE passthrough.
Here is the code version of the video.

This specific code from the guide caused confusion on me and I'll address it here.
Code:
echo "options vfio-pci ids=####.####,####.#### disable_vga=1"> /etc/modprobe.d/vfio.conf

schould be

echo "options vfio-pci ids=####:####,####:#### disable_vga=1"> /etc/modprobe.d/vfio.conf

like

echo "options vfio-pci ids=1002:731f,1002:ab38 disable_vga=1"> /etc/modprobe.d/vfio.conf

All commands can be entered and run through SSH, if you don't know how it works follow this video. The terminal on the PC you use to configure should look something like:
Code:
tomi@Tomis-MacBook-Pro ~ % ssh root@192.168.0.56 (replace this ip with your proxmox ip)

root@192.168.0.56's password: (your proxmox password, will not be shown on screen)
Linux tomi 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sat Feb 10 22:56:59 2024 from 192.168.0.162
root@tomi:~# (you type commands here, press enter to run)

Step 1: Update Your System​

First, ensure your system is up to date. This helps avoid potential conflicts and ensures you have the latest versions of essential tools and libraries.

Code:
apt-get update
apt-get upgrade -y

Step 2: Install Necessary Tools​

Install git and dkms, which are necessary for fetching the vendor-reset source and managing the module. Enter y when prompted.

Code:
apt-get install git dkms -y

Step 3: Clone the vendor-reset Repository from GitHub​

Clone the vendor-reset repository to your local system.

Code:
git clone https://github.com/gnif/vendor-reset.git

Step 4: Navigate to the Cloned Directory​

Change into the directory containing the cloned vendor-reset source code.

Code:
cd vendor-reset

Step 5: Add the Module to DKMS​

Add the vendor-reset module to DKMS. This step prepares DKMS to manage building and installing the module, unlike the method above (which sadly didn't work for me)

Code:
dkms add .

Step 6: Build and Install the Module​

Build and install the vendor-reset module for your current kernel version. This step compiles the module and adds it to your system.

Code:
dkms build vendor-reset/0.1.1 -k $(uname -r)
dkms install vendor-reset/0.1.1 -k $(uname -r)

Replace 0.1.1 with the actual version of vendor-reset if it differs.

Step 7: Ensure the Module Loads on Boot​

To automatically load the vendor-reset module at boot time, add it to the /etc/modules file.

Code:
echo "vendor-reset" >> /etc/modules

Step 8: Update Initramfs​

Update the initial RAM filesystem to ensure the module is available at boot time.

Code:
update-initramfs -u

Step 9: Set flag to enable vendor reset for devices you want​

To enable vendor reset function for a specific device after each VM boot, you need to set a flag for proxmox to recognise.

Code:
# My GPU Bus Address is 0000:04:00.0
echo 'device_specific' > /sys/bus/pci/devices/0000\:04\:00.0/reset_method

If you don't know the GPU bus address for your GPU, use

Code:
lspci -kk

and find something like this

Code:
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
    Subsystem: XFX Pine Group Inc. RX 5700 XT RAW II
    Kernel driver in use: vfio-pci
    Kernel modules: amdgpu

#The bus address is 04:00.0, this is absolutely not the same for everyone so do check for yourself

Step 10: Reboot Your System​

Reboot your system to apply all changes and ensure the vendor-reset module loads correctly.

Code:
reboot

Step 11: Verify the Installation​

After rebooting, verify that the vendor-reset module is loaded and functioning.
  1. Check if the module is loaded:
    Code:
    lsmod | grep vendor_reset

    Positive result:
    Code:
    vendor_reset (and some number after it)
  2. Inspect kernel messages for vendor-reset:
    Code:
    dmesg | grep vendor_reset

    Positive result:
    Code:
    [   10.798456] vendor_reset_hook: installed
If you see references to vendor_reset in the output of these commands, the module has been successfully loaded.

Error codes & symptoms you might encounter if you don't do vendor reset:​

Symptom:
Your GPU PCI passthrough fails the second time you start any VM with the same graphics card, with error codes like:
Code:
swtpm_setup: Not overwriting existing state file.kvm: ../hw/pci/pci.c:1637: pci_irq_handler:
Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 17041) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
Code:
vfio-pci 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible

Should you want to find any other forum posts related to this topic (that I encountered):
https://forum.proxmox.com/threads/help-me-boot-into-a-pci-passthrough-gpu-5700xt.138713/
https://github.com/gnif/vendor-reset/issues/46#issuecomment-992282166
https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/

Shoutout to @houlin with his original solution posted here
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!