GPU pass through with Radeon VII

erickgruis

New Member
Apr 7, 2023
9
2
3
I'm somewhat new to Proxmox but fairly skilled with Linux overall. I set up Proxmox on a retired mining machine build. I have researched and tried many, many things but cannot get the vendor-reset to work with the Radeon VII.

Has anyone actually got this to work?

The issue is this: VM will start and run perfectly on the first boot of the host but once the VM has been shut down, subsequent re-starting of VM results in an unusable card due to not being able to reset power state.

Host machine:
Asus Crosshair VII Hero X470 (bios 5003)
AMD 3800xt cpu
AMD Radeon VII (2) (vbios 106)

I'm attempting to pass the second GPU (not used for boot), so it's not an issue of the boot process grabbing the GPU. I have tried passing both cards with the exact same results.
Vendor-reset and "echo 'device_specific' > /sys/bus/pci/devices/0000:xx:00.0/reset_method" is being implemented

Here is output from dmesg:
[ 1824.819222] vfio-pci 0000:0f:00.0: AMD_VEGA20: version 1.0
[ 1824.819230] vfio-pci 0000:0f:00.0: AMD_VEGA20: performing pre-reset
[ 1824.819363] vfio-pci 0000:0f:00.0: AMD_VEGA20: performing reset
[ 1824.842702] vfio-pci 0000:0f:00.1: Refused to change power state from D0 to D3hot
[ 1825.338703] vfio-pci 0000:0f:00.0: AMD_VEGA20: psp mode1 reset succeeded
[ 1825.338709] vfio-pci 0000:0f:00.0: AMD_VEGA20: performing post-reset
[ 1825.358698] vfio-pci 0000:0f:00.1: Refused to change power state from D0 to D3hot
[ 1825.378698] vfio-pci 0000:0f:00.0: Refused to change power state from D0 to D3hot
[ 1825.378701] vfio-pci 0000:0f:00.0: AMD_VEGA20: reset result = 0
[ 1825.398695] vfio-pci 0000:0f:00.0: Refused to change power state from D0 to D3hot

Attempted:
Various kernels (5.13, 5.15, 5.19, 6.0)
Both gpu vbios (105, 106)
Various mb bios

I'm close to giving up but I hate to do that considering this build is so perfect for this use case. W11 VM (on first boot) runs amazing with near-bare-metal performance. So, looking for any advice I can get and really hoping that SOMEONE has actually had a Radeon VII pass through that functions for more than one start of a VM.
 
gnif himself reported that his issue (VEGA20 only working once) cannot be solved by vendor-reset, unfortunately.
I did see that note on the git page. But, I've seen a few other posts where people are using Radeon VII and seemed to MAYBE have things working.
Often people stop posting once they have their system functioning so the final solution doesn't get relayed.
 
So, I have my solution and will share it for anyone else running into the same roadblock.

I have dual gpu's in this system so the files reflect that but I can confirm the solution works for both of them so a single gpu system should
be able to employ this work-around.

Some of the details of my system setup may not matter (ie kernel version). I did not fully re-test on the stock kernel after finding success. If you end up changing your kernel to 6.0.9-edge, be sure to also install kernel headers and be sure to add vendor-reset modules to new kernel (dkms). If you are unsure you have the module loaded, 'uname -a' and then 'dkms status' to confirm it's loaded for your current kernel (see my 'dkms status' below showing module installed for all kernels)

SUMMARY
I installed and set everything up in various ways based on tons of scouring of this and other forums. I was finally able to get gpu pass through working for a Windows 11 VM and a MacOS VM. My fun was stopped when I realized vendor-reset was not working for my gpu's.

The VM would run perfectly after a fresh host boot and a single VM run, but if the VM was stopped and then re-started, the gpu pass through was broken due to the gpu refusing to reset power states. 'dmesg -w' would reveal multiple errors stating the reset failed 'reset result = 0' and the gpu 'refused to change power state from D0 to D3Hot'.

The system is still employing the 'vendor-reset' kernel module and adding the 'device-specific' option needed for newer kernels. I don't know if this is actually key to things working with the manual reset script. I haven't tried removing it to test and I likely will not bother for now.

The manual reset solution was presented elsewhere but the exact script I saw did not work for me. The order of the commands had to be modified. My version does not require me pressing the power button to wake the system, it comes back automatically. This isn't an ideal solution but I believe it's the only one due to an actual hardware/firmware flaw in the Radeon VII. The good news is that this reset works great! I can even employ it while another TrueNas VM is active and running and it doesn't seem to crash it or result in any problems. However, it will affect access to that VM for a few seconds so don't run the 'gpu-reset' if anything important is currently being accessed from another running VM. Network connectivity and disk activity is halted for 8-12 seconds, so data loss will occur if you aren't careful about other running VM's

Small bonus: I have openRGB installed on this server (a dual gpu, fully water cooled beast, retired from ETH mining) and use the openRGB server/client feature to control the lighting. This gpu-reset process does not reset the RGB lighting on the system! I know, it's a little corny but the PC is very visible and quite good to look at and that default rainbow cycling makes me nauseous lol.

I hope this helps someone else out there! The Radeon VII's have a special place in my heart, as they really cranked out the Ethereum hash rates for me and since these have water blocks, changing gpu's just to play with VM's isn't an option.


Neofetch output:

OS: Proxmox VE 7.4-3 x86_64
Kernel: 6.0.9-edge
Uptime: 20 hours, 58 mins
Packages: 1040 (dpkg)
Shell: bash 5.1.4
Terminal: /dev/pts/0
CPU: AMD Ryzen 7 3800XT (16) @ 3.900GHz
GPU: AMD ATI Radeon VII
GPU: AMD ATI Radeon VII
Memory: 13671MiB / 64211MiB

Motherboard: Asus Crosshair VII Hero (bios 5003) (x470 platform/chipset)

Important MB settings:

Advanced\CPU:

NX Mode - Enabled
SVM Mode - Enabled
Advanced\PCI Subsystem Settings:
Above 4G Decoding - Enabled *
Resize Bar Support - Disabled *
SR-IOV Support - Enabled *
Advanced\AMD CBS\NBIO Common Options:
IOMMU - Enabled
ACS Enable - Enable
PCIe ARI Support - Enable *
PCIe ARI Enumeration - Enable *
PCIe Ten Bit Tag Support - Enable *

* This setting may or may not be critical but was at this setting with a fully functioning system

Important PVE files and settings:

root@bde-waterserver:~# cat /etc/default/grub

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iomme=pt"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

root@bde-waterserver:/etc/modprobe.d# cat blacklist.conf
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia

root@bde-waterserver:/etc/modprobe.d# cat dkms.conf
# modprobe information used for DKMS modules
#
# This is a stub file, should be edited when needed,
# used by default by DKMS.

root@bde-waterserver:/etc/modprobe.d# cat iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

root@bde-waterserver:/etc/modprobe.d# cat kvm.conf
options kvm ignore_msrs=1

root@bde-waterserver:/etc/modprobe.d# cat pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

root@bde-waterserver:/etc/modprobe.d# cat snd-hda-intel.conf

root@bde-waterserver:/etc/modprobe.d# cat vfio.conf

options vfio-pci ids=1002:66af,1002:ab20 disable_vga=1
softdep amdgpu pre: vfio vfio_pci

**Vendor-reset installed using this guide: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/

root@bde-waterserver:/etc/modprobe.d# dkms status

vendor-reset, 0.1.1, 5.13.9-1-edge, x86_64: installed
vendor-reset, 0.1.1, 5.15.104-1-pve, x86_64: installed
vendor-reset, 0.1.1, 6.0.9-edge, x86_64: installed

**Systemd service created to set reset_method to make sure vendor-reset module loaded and echo 'device_specific' to gpu variable 'reset_method'. Create file at /etc/systemd/system/vendor-specific.service. Then 'systemctl enable vendor-specific.service' 'systemctl start vendor-specific.service'

root@bde-waterserver:/etc/systemd/system# cat vendor-specific.service

[Unit]
Description=Set the AMD GPU reset method to 'device_specific'
After=multi-user.target

[Service]
ExecStart=/usr/bin/bash -c '/usr/sbin/modprobe vendor-reset && /usr/bin/echo device_specific > /sys/bus/pci/devices/0000:0f:00.0/reset_method && /usr/bin/echo device_specific
> /sys/bus/pci/devices/0000:0c:00.0/reset_method'

[Install]
WantedBy=multi-user.target

**Create bash script to manually reset gpu(s) after VM is shut down. IMPORTANT: VM seems to require a 'Hard Stop' command from the console to fully unload from host. I may look for a way to automate this later. Shutdown from within VM or a 'Shutdown' command is not sufficient for reset to work. I usually shutdown within VM and then 'Hard Stop' from PVE console.

**Place script at /usr/local/bin/gpu-reset. After VM is shut down correctly, enter command 'gpu-reset'. Script pauses briefly between commands just for safety/caution - a system wake up timer is set for 8 seconds - system is sent into suspend state - after wake up gpu(s) are removed from bus/pci devices (I believe unloaded from driver) - rescan pci bus devices and gpu's are added back - a power_state reset has occured.

root@bde-waterserver:~# cat /usr/local/bin/gpu-reset

#!/bin/bash
# Remove gpus 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices

sleep 2
rtcwake -m no -s 8 && systemctl suspend
sleep 3
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 3
echo 1 > /sys/bus/pci/rescan

**GPU's attached to VM's as PCI Device with different settings for W11 and MacOS. This may work with different settings than I have here but 'All Functions' and 'PCI-Express' set to on, seemed to be mandatory for me. The W11 machine has 'Primary GPU' on but 'Rom-Bar' off. The MacOS has 'Rom-Bar' on but 'Primary GPU' off.

Windows 11 VM:

root@bde-waterserver:/etc/modprobe.d# cat /etc/pve/qemu-server/103.conf
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=proxmox,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_tlbflush
,hv_ipi,kvm=off'
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 8
cpu: host
efidisk0: local-lvm:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:0f:00,pcie=1,rombar=0,romfile=R7-105-UEFI.rom,x-vga=1
ide0: ext4-storage:iso/virtio-win-0.1.229.iso,media=cdrom,size=522284K
machine: pc-q35-7.2
memory: 8192
meta: creation-qemu=7.2.0,ctime=1681232752
name: Win11-02
net0: e1000=0A:AE:B6:DE:65:0C,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=ba8e2087-da2b-4fa7-bb75-a29096423deb
sockets: 1
tpmstate0: local-lvm:vm-103-disk-1,size=4M,version=v2.0
usb0: host=1b1c:1b35
usb1: host=320f:5000
vga: none
virtio0: local-lvm:vm-103-disk-2,discard=on,iothread=1,size=80G
vmgenid: 78701035-a047-4d35-83f7-65152d3b3679

Mac OS Monterey VM:
root@bde-waterserver:/etc/modprobe.d# cat /etc/pve/qemu-server/104.conf
agent: 1
args: -device isa-applesmc,osk="ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc" -smbios type=2 -device usb-kbd,bus=ehci.0,port=2 -global nec-usb-xhci.msi=of
f -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off -cpu Penryn,vendor=GenuineIntel,+invtsc,+hypervisor,kvm=on,vmware-cpuid-freq=on
balloon: 0
bios: ovmf
boot: order=virtio0;net0
cores: 8
cpu: Penryn
efidisk0: local-lvm:vm-104-disk-0,efitype=4m,size=4M
hostpci0: 0000:0f:00,pcie=1
machine: q35
memory: 8192
meta: creation-qemu=7.2.0,ctime=1681331224
name: MacOSMonterey
net0: vmxnet3=0A:87:67:5F:A1:29,bridge=vmbr0,firewall=1
numa: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=ad50f489-e055-47f5-9fcb-ce7ee62bfc15
sockets: 1
usb0: host=1b1c:1b35,usb3=1
usb1: host=320f:5000,usb3=1
vga: none
virtio0: local-lvm:vm-104-disk-1,cache=unsafe,discard=on,size=64G
vmgenid: 59a24833-92e8-4e30-bc9f-4199f90c113d




IMG20230413100927.jpgIMG20230413100841.jpg
 
Last edited:
  • Like
Reactions: Daze
So, I have my solution and will share it for anyone else running into the same roadblock.

I have dual gpu's in this system so the files reflect that but I can confirm the solution works for both of them so a single gpu system should
be able to employ this work-around.

Some of the details of my system setup may not matter (ie kernel version). I did not fully re-test on the stock kernel after finding success. If you end up changing your kernel to 6.0.9-edge, be sure to also install kernel headers and be sure to add vendor-reset modules to new kernel (dkms). If you are unsure you have the module loaded, 'uname -a' and then 'dkms status' to confirm it's loaded for your current kernel (see my 'dkms status' below showing module installed for all kernels)

SUMMARY
I installed and set everything up in various ways based on tons of scouring of this and other forums. I was finally able to get gpu pass through working for a Windows 11 VM and a MacOS VM. My fun was stopped when I realized vendor-reset was not working for my gpu's.

The VM would run perfectly after a fresh host boot and a single VM run, but if the VM was stopped and then re-started, the gpu pass through was broken due to the gpu refusing to reset power states. 'dmesg -w' would reveal multiple errors stating the reset failed 'reset result = 0' and the gpu 'refused to change power state from D0 to D3Hot'.

The system is still employing the 'vendor-reset' kernel module and adding the 'device-specific' option needed for newer kernels. I don't know if this is actually key to things working with the manual reset script. I haven't tried removing it to test and I likely will not bother for now.

The manual reset solution was presented elsewhere but the exact script I saw did not work for me. The order of the commands had to be modified. My version does not require me pressing the power button to wake the system, it comes back automatically. This isn't an ideal solution but I believe it's the only one due to an actual hardware/firmware flaw in the Radeon VII. The good news is that this reset works great! I can even employ it while another TrueNas VM is active and running and it doesn't seem to crash it or result in any problems. However, it will affect access to that VM for a few seconds so don't run the 'gpu-reset' if anything important is currently being accessed from another running VM. Network connectivity and disk activity is halted for 8-12 seconds, so data loss will occur if you aren't careful about other running VM's

Small bonus: I have openRGB installed on this server (a dual gpu, fully water cooled beast, retired from ETH mining) and use the openRGB server/client feature to control the lighting. This gpu-reset process does not reset the RGB lighting on the system! I know, it's a little corny but the PC is very visible and quite good to look at and that default rainbow cycling makes me nauseous lol.

I hope this helps someone else out there! The Radeon VII's have a special place in my heart, as they really cranked out the Ethereum hash rates for me and since these have water blocks, changing gpu's just to play with VM's isn't an option.


Neofetch output:

OS: Proxmox VE 7.4-3 x86_64
Kernel: 6.0.9-edge
Uptime: 20 hours, 58 mins
Packages: 1040 (dpkg)
Shell: bash 5.1.4
Terminal: /dev/pts/0
CPU: AMD Ryzen 7 3800XT (16) @ 3.900GHz
GPU: AMD ATI Radeon VII
GPU: AMD ATI Radeon VII
Memory: 13671MiB / 64211MiB

Motherboard: Asus Crosshair VII Hero (bios 5003) (x470 platform/chipset)

Important MB settings:

Advanced\CPU:

NX Mode - Enabled
SVM Mode - Enabled
Advanced\PCI Subsystem Settings:
Above 4G Decoding - Enabled *
Resize Bar Support - Disabled *
SR-IOV Support - Enabled *
Advanced\AMD CBS\NBIO Common Options:
IOMMU - Enabled
ACS Enable - Enable
PCIe ARI Support - Enable *
PCIe ARI Enumeration - Enable *
PCIe Ten Bit Tag Support - Enable *

* This setting may or may not be critical but was at this setting with a fully functioning system

Important PVE files and settings:

root@bde-waterserver:~# cat /etc/default/grub

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iomme=pt"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

root@bde-waterserver:/etc/modprobe.d# cat blacklist.conf
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia

root@bde-waterserver:/etc/modprobe.d# cat dkms.conf
# modprobe information used for DKMS modules
#
# This is a stub file, should be edited when needed,
# used by default by DKMS.

root@bde-waterserver:/etc/modprobe.d# cat iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

root@bde-waterserver:/etc/modprobe.d# cat kvm.conf
options kvm ignore_msrs=1

root@bde-waterserver:/etc/modprobe.d# cat pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

root@bde-waterserver:/etc/modprobe.d# cat snd-hda-intel.conf

root@bde-waterserver:/etc/modprobe.d# cat vfio.conf

options vfio-pci ids=1002:66af,1002:ab20 disable_vga=1
softdep amdgpu pre: vfio vfio_pci

**Vendor-reset installed using this guide: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/

root@bde-waterserver:/etc/modprobe.d# dkms status

vendor-reset, 0.1.1, 5.13.9-1-edge, x86_64: installed
vendor-reset, 0.1.1, 5.15.104-1-pve, x86_64: installed
vendor-reset, 0.1.1, 6.0.9-edge, x86_64: installed

**Systemd service created to set reset_method to make sure vendor-reset module loaded and echo 'device_specific' to gpu variable 'reset_method'. Create file at /etc/systemd/system/vendor-specific.service. Then 'systemctl enable vendor-specific.service' 'systemctl start vendor-specific.service'

root@bde-waterserver:/etc/systemd/system# cat vendor-specific.service

[Unit]
Description=Set the AMD GPU reset method to 'device_specific'
After=multi-user.target

[Service]
ExecStart=/usr/bin/bash -c '/usr/sbin/modprobe vendor-reset && /usr/bin/echo device_specific > /sys/bus/pci/devices/0000:0f:00.0/reset_method && /usr/bin/echo device_specific
> /sys/bus/pci/devices/0000:0c:00.0/reset_method'

[Install]
WantedBy=multi-user.target

**Create bash script to manually reset gpu(s) after VM is shut down. IMPORTANT: VM seems to require a 'Hard Stop' command from the console to fully unload from host. I may look for a way to automate this later. Shutdown from within VM or a 'Shutdown' command is not sufficient for reset to work. I usually shutdown within VM and then 'Hard Stop' from PVE console.

**Place script at /usr/local/bin/gpu-reset. After VM is shut down correctly, enter command 'gpu-reset'. Script pauses briefly between commands just for safety/caution - a system wake up timer is set for 8 seconds - system is sent into suspend state - after wake up gpu(s) are removed from bus/pci devices (I believe unloaded from driver) - rescan pci bus devices and gpu's are added back - a power_state reset has occured.

root@bde-waterserver:~# cat /usr/local/bin/gpu-reset

#!/bin/bash
# Remove gpus 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices

sleep 2
rtcwake -m no -s 8 && systemctl suspend
sleep 3
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 3
echo 1 > /sys/bus/pci/rescan

**GPU's attached to VM's as PCI Device with different settings for W11 and MacOS. This may work with different settings than I have here but 'All Functions' and 'PCI-Express' set to on, seemed to be mandatory for me. The W11 machine has 'Primary GPU' on but 'Rom-Bar' off. The MacOS has 'Rom-Bar' on but 'Primary GPU' off.

Windows 11 VM:

root@bde-waterserver:/etc/modprobe.d# cat /etc/pve/qemu-server/103.conf
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=proxmox,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_tlbflush
,hv_ipi,kvm=off'
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 8
cpu: host
efidisk0: local-lvm:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:0f:00,pcie=1,rombar=0,romfile=R7-105-UEFI.rom,x-vga=1
ide0: ext4-storage:iso/virtio-win-0.1.229.iso,media=cdrom,size=522284K
machine: pc-q35-7.2
memory: 8192
meta: creation-qemu=7.2.0,ctime=1681232752
name: Win11-02
net0: e1000=0A:AE:B6:DE:65:0C,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=ba8e2087-da2b-4fa7-bb75-a29096423deb
sockets: 1
tpmstate0: local-lvm:vm-103-disk-1,size=4M,version=v2.0
usb0: host=1b1c:1b35
usb1: host=320f:5000
vga: none
virtio0: local-lvm:vm-103-disk-2,discard=on,iothread=1,size=80G
vmgenid: 78701035-a047-4d35-83f7-65152d3b3679

Mac OS Monterey VM:
root@bde-waterserver:/etc/modprobe.d# cat /etc/pve/qemu-server/104.conf
agent: 1
args: -device isa-applesmc,osk="ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc" -smbios type=2 -device usb-kbd,bus=ehci.0,port=2 -global nec-usb-xhci.msi=of
f -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off -cpu Penryn,vendor=GenuineIntel,+invtsc,+hypervisor,kvm=on,vmware-cpuid-freq=on
balloon: 0
bios: ovmf
boot: order=virtio0;net0
cores: 8
cpu: Penryn
efidisk0: local-lvm:vm-104-disk-0,efitype=4m,size=4M
hostpci0: 0000:0f:00,pcie=1
machine: q35
memory: 8192
meta: creation-qemu=7.2.0,ctime=1681331224
name: MacOSMonterey
net0: vmxnet3=0A:87:67:5F:A1:29,bridge=vmbr0,firewall=1
numa: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=ad50f489-e055-47f5-9fcb-ce7ee62bfc15
sockets: 1
usb0: host=1b1c:1b35,usb3=1
usb1: host=320f:5000,usb3=1
vga: none
virtio0: local-lvm:vm-104-disk-1,cache=unsafe,discard=on,size=64G
vmgenid: 59a24833-92e8-4e30-bc9f-4199f90c113d




View attachment 49181View attachment 49182
Been struggling with this same thing and results have only been black screens on boot and installing everything from the scratch. If you would like to help me with this same problem i sure would appreciate it. Radeon VII on AMD Auros master x570 motherboard, no grub
 
Last edited:
Been struggling with this same thing and results have only been black screens on boot and installing everything from the scratch. If you would like to help me with this same problem i sure would appreciate it. Radeon VII on AMD Auros master x570 motherboard, no grub
I have Radeon VII too ,but the rest didn't works for me.The guest system(windows 10) report that code 43.
 
I have Radeon VII too ,but the rest didn't works for me.The guest system(windows 10) report that code 43.
Did you try my reset script? Nothing else really works but that does. You just have to manually run it after a VM is stopped.
**Place script at /usr/local/bin/gpu-reset. After VM is shut down correctly, enter command 'gpu-reset'. Script pauses briefly between commands just for safety/caution - a system wake up timer is set for 8 seconds - system is sent into suspend state - after wake up gpu(s) are removed from bus/pci devices (I believe unloaded from driver) - rescan pci bus devices and gpu's are added back - a power_state reset has occured.

You place this script into /usr/local/bin/gpu-reset

Then, when you shut down a VM that was using the GPU, you go to the ProxMox command line/shell and type in "gpu-reset"

This will put the PC to sleep for 3 seconds and then wake it back up. You may need to configure your BIOS for "wake from real time clock = enable" or something like that.

I have 2 gpu's so I have 2 devices listed in that script. If you only have one, then obviously you only need the device ID's for that one. They will likely be different from mine. It will be something like 0000:xx:xx.0 and 0000:xx:xx.1

You can find your device ID by "lscpi" command in Shell. Look for VGA compatible controller... Vega 20, and Audio device (right below it).

I found that original script listed above was a little unreliable so I updated it to the following. I now have 3 scripts.
gpu-reset: put system to sleep and remove/rescan all pcie devices
gpu-scan: rescan the pcie devices (again, if needed)
gpu-ps: list the GPU power state. If the reset worked, they will show as D3hot. If they show D0, then it's not reset.

/usr/local/bin/gpu-reset

Code:
#!/bin/bash

# Remove gpu 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices

rtcwake -m no -s 10 && systemctl suspend
sleep 6
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan
sleep 2
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan

/usr/local/bin/gpu-scan

Code:
#!/bin/bash

# rescan pci bus devices

echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan

/usr/local/bin/gpu-ps

Code:
#!/bin/bash

# display current power states of pcie devices

echo "device 0000:0c:00.0"
cat /sys/bus/pci/devices/0000:0c:00.0/power_state
echo "device 0000:0c:00.1"
cat /sys/bus/pci/devices/0000:0c:00.1/power_state
echo "device 0000:0f:00.0"
cat /sys/bus/pci/devices/0000:0f:00.0/power_state
echo "device 0000:0f:00.1"
cat /sys/bus/pci/devices/0000:0f:00.1/power_state

After you shut down a VM that had the GPU passed through, I also hit the "hard stop" button on the Console for that VM. Then, go to the main server Shell and type each command, press enter, then next command.

gpu-reset
gpu-scan
gpu-ps

Then, you can start a new VM with the GPU passed through successfully!
 
  • Like
Reactions: Mimei
Did you try my reset script? Nothing else really works but that does. You just have to manually run it after a VM is stopped.
**Place script at /usr/local/bin/gpu-reset. After VM is shut down correctly, enter command 'gpu-reset'. Script pauses briefly between commands just for safety/caution - a system wake up timer is set for 8 seconds - system is sent into suspend state - after wake up gpu(s) are removed from bus/pci devices (I believe unloaded from driver) - rescan pci bus devices and gpu's are added back - a power_state reset has occured.

You place this script into /usr/local/bin/gpu-reset

Then, when you shut down a VM that was using the GPU, you go to the ProxMox command line/shell and type in "gpu-reset"

This will put the PC to sleep for 3 seconds and then wake it back up. You may need to configure your BIOS for "wake from real time clock = enable" or something like that.

I have 2 gpu's so I have 2 devices listed in that script. If you only have one, then obviously you only need the device ID's for that one. They will likely be different from mine. It will be something like 0000:xx:xx.0 and 0000:xx:xx.1

You can find your device ID by "lscpi" command in Shell. Look for VGA compatible controller... Vega 20, and Audio device (right below it).

I found that original script listed above was a little unreliable so I updated it to the following. I now have 3 scripts.
gpu-reset: put system to sleep and remove/rescan all pcie devices
gpu-scan: rescan the pcie devices (again, if needed)
gpu-ps: list the GPU power state. If the reset worked, they will show as D3hot. If they show D0, then it's not reset.

/usr/local/bin/gpu-reset

Code:
#!/bin/bash

# Remove gpu 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices

rtcwake -m no -s 10 && systemctl suspend
sleep 6
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan
sleep 2
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan

/usr/local/bin/gpu-scan

Code:
#!/bin/bash

# rescan pci bus devices

echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan

/usr/local/bin/gpu-ps

Code:
#!/bin/bash

# display current power states of pcie devices

echo "device 0000:0c:00.0"
cat /sys/bus/pci/devices/0000:0c:00.0/power_state
echo "device 0000:0c:00.1"
cat /sys/bus/pci/devices/0000:0c:00.1/power_state
echo "device 0000:0f:00.0"
cat /sys/bus/pci/devices/0000:0f:00.0/power_state
echo "device 0000:0f:00.1"
cat /sys/bus/pci/devices/0000:0f:00.1/power_state

After you shut down a VM that had the GPU passed through, I also hit the "hard stop" button on the Console for that VM. Then, go to the main server Shell and type each command, press enter, then next command.

gpu-reset
gpu-scan
gpu-ps

Then, you can start a new VM with the GPU passed through successfully!
thanks i will try it .when i try it i will reply to you Thank you!
 
Did you try my reset script? Nothing else really works but that does. You just have to manually run it after a VM is stopped.
**Place script at /usr/local/bin/gpu-reset. After VM is shut down correctly, enter command 'gpu-reset'. Script pauses briefly between commands just for safety/caution - a system wake up timer is set for 8 seconds - system is sent into suspend state - after wake up gpu(s) are removed from bus/pci devices (I believe unloaded from driver) - rescan pci bus devices and gpu's are added back - a power_state reset has occured.

You place this script into /usr/local/bin/gpu-reset

Then, when you shut down a VM that was using the GPU, you go to the ProxMox command line/shell and type in "gpu-reset"

This will put the PC to sleep for 3 seconds and then wake it back up. You may need to configure your BIOS for "wake from real time clock = enable" or something like that.

I have 2 gpu's so I have 2 devices listed in that script. If you only have one, then obviously you only need the device ID's for that one. They will likely be different from mine. It will be something like 0000:xx:xx.0 and 0000:xx:xx.1

You can find your device ID by "lscpi" command in Shell. Look for VGA compatible controller... Vega 20, and Audio device (right below it).

I found that original script listed above was a little unreliable so I updated it to the following. I now have 3 scripts.
gpu-reset: put system to sleep and remove/rescan all pcie devices
gpu-scan: rescan the pcie devices (again, if needed)
gpu-ps: list the GPU power state. If the reset worked, they will show as D3hot. If they show D0, then it's not reset.

/usr/local/bin/gpu-reset

Code:
#!/bin/bash

# Remove gpu 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices

rtcwake -m no -s 10 && systemctl suspend
sleep 6
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2
echo 1 > /sys/bus/pci/rescan
sleep 2
echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
回显 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
回显 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
回显 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
回显 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2  睡觉2
echo 1 > /sys/bus/pci/rescan
回显 1 > /sys/bus/pci/rescan[/CODE]

/usr/local/bin/gpu-scan

Code:
#!/bin/bash  [代码]#!/bin/bash

# rescan pci bus devices
# 重新扫描 pci 总线设备

echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
回显 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
回显 1 > /sys/bus/pci/devices/0000:0f:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
回显 1 > /sys/bus/pci/devices/0000:0c:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
回显 1 > /sys/bus/pci/devices/0000:0c:00.1/remove
sleep 2  睡觉2
echo 1 > /sys/bus/pci/rescan
回显 1 > /sys/bus/pci/rescan[/CODE]

/usr/local/bin/gpu-ps

Code:
#!/bin/bash  [代码]#!/bin/bash

# display current power states of pcie devices
# 显示 PCIe 设备当前的电源状态

echo "device 0000:0c:00.0"
回显“设备0000:0c:00.0”
cat /sys/bus/pci/devices/0000:0c:00.0/power_state
猫 /sys/bus/pci/devices/0000:0c:00.0/power_state
echo "device 0000:0c:00.1"
回显“设备0000:0c:00.1”
cat /sys/bus/pci/devices/0000:0c:00.1/power_state
猫 /sys/bus/pci/devices/0000:0c:00.1/power_state
echo "device 0000:0f:00.0"
回显“设备0000:0f:00.0”
cat /sys/bus/pci/devices/0000:0f:00.0/power_state
猫 /sys/bus/pci/devices/0000:0f:00.0/power_state
echo "device 0000:0f:00.1"
回显“设备0000:0f:00.1”
cat /sys/bus/pci/devices/0000:0f:00.1/power_state
猫 /sys/bus/pci/devices/0000:0f:00.1/power_state[/CODE]

After you shut down a VM that had the GPU passed through, I also hit the "hard stop" button on the Console for that VM. Then, go to the main server Shell and type each command, press enter, then next command.
在关闭 GPU 通过的虚拟机后,我还点击了该虚拟机控制台上的“硬停止”按钮。然后,转到主服务器 Shell 并键入每个命令,按 Enter 键,然后键入下一个命令。

gpu-reset GPU重置
gpu-scan GPU扫描
gpu-ps

Then, you can start a new VM with the GPU passed through successfully!
然后就可以启动一个新的VM了,GPU已经成功通过了!
it works...
Thank you ,You are my god
1735735847209.png
this post also remind that use
```
<span>apt</span> <span>install</span> pve-headers-<span><span>$(</span><span>uname</span> -r<span>)</span></span><br><span>apt</span> <span>install</span> <span>git</span> dkms build-essential<br><span>git</span> clone https://github.com/gnif/vendor-reset.git<br><span>cd</span> vendor-reset<br>dkms <span>install</span> <span>.</span><br><span>echo</span> <span>"vendor-reset"</span> <span>&gt;&gt;</span> /etc/modules<br>update-initramfs -u<br><span>shutdown</span> -r now

```
 
  • Like
Reactions: erickgruis
OH MY GOD
YOUR SCRIPT REALLY WORKS
vendor-reset only can sue once,since the vm reboot,code 43 again,dmsg with Failed to send message 0x25: return 0x0 reset amd
And i use your script with dmesg output:
[ 2456.918748] vfio-pci 0000:09:00.0: enabling device (0400 -> 0403)
[ 2456.918901] vfio-pci 0000:09:00.0: AMD_VEGA20: version 1.0
[ 2456.918903] vfio-pci 0000:09:00.0: AMD_VEGA20: performing pre-reset
[ 2456.919008] vfio-pci 0000:09:00.0: AMD_VEGA20: performing reset
[ 2457.190361] vfio-pci 0000:09:00.0: AMD_VEGA20: no SOL, not attempting BACO reset
[ 2457.190363] vfio-pci 0000:09:00.0: AMD_VEGA20: performing post-reset
[ 2457.202499] vfio-pci 0000:09:00.0: AMD_VEGA20: reset result = 0
[ 2457.245968] vfio-pci 0000:09:00.0: AMD_VEGA20: version 1.0
[ 2457.245971] vfio-pci 0000:09:00.0: AMD_VEGA20: performing pre-reset
[ 2457.246093] vfio-pci 0000:09:00.0: AMD_VEGA20: performing reset
[ 2457.522227] vfio-pci 0000:09:00.0: AMD_VEGA20: no SOL, not attempting BACO reset
[ 2457.522229] vfio-pci 0000:09:00.0: AMD_VEGA20: performing post-reset
[ 2457.534374] vfio-pci 0000:09:00.0: AMD_VEGA20: reset result = 0
so i guess that it means your script fix vendor-reset's bug
 
  • Like
Reactions: erickgruis
I had to do a lot of seeking for help to get mine working also. I'm very happy I could pass along the help!
 
Hi, I will join the thread because unfortunately I lost 2 Radeon PRO VII cards.
I use an HP DL580 G9 server at home( Virtual Environment 8.4.1), I recently purchased 3 Radeon Pro VII cards, the same reset bug was solved with me by the “Vendor reset” project, but it didn't want to work from the beginning ( just dmesg didn't show reset messages) To make it work I used “/usr/local/bin/gpu-reset” giving the id of my cards after this script actually vendor reset worked every time .
I installed debian and the ROCm driver for computing with LLM models and everything worked despite the reported i2c errors of the card driver. Until the card being in idle did not try to go into power saving mode, I only saw a yellow triangle on the VM after which the PRO VII card died.
Unfortunately in this way I lost 2 cards trying to check what is going on praprobably when going into power save mode something is wrong and it kild the gpu-core.
Maybe someone had similar experience or is it worth to disable power saving modes for Vega20 ?

P.S.
Sorry for the quality of the screenshots but this is the only thing I have left as a trace of what happened. Unfortunately I don't have the last error message when the card tried to go into power saving mode then died forever. The errors on the screenshot show what I was getting when the cards worked for ROCm after gpu-reset script and vendor-reset.

Greetings
 

Attachments

  • preview.webp
    preview.webp
    14 KB · Views: 1
  • preview (1).webp
    preview (1).webp
    7.2 KB · Views: 1
Last edited: