GPU Reset script not working on new kernel

paulmorabi

Member
Mar 30, 2019
81
8
13
44
Hi,

I have a Ryzen 2700 system with both an RX580 and NVidia 1050ti passed through to separate VM's. I was having issues when VM's were stopped or rebooted within the VM. I found this script online and set a cron for it to run on every reboot:


Code:
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

I've recently upgraded to kernel "Linux pve 5.4.65-1-pve" and now the script is giving errors:

Code:
/root/nvidia.sh: line 2: /sys/class/vtconsole/vtcon1/bind: No such file or directory
/root/nvidia.sh: line 3: echo: write error: No such device

These devices seem to no longer exist. Has anything changed recently in the kernel to affect this? Any where else I can look for these same devices? Within /sys/class/vtconsole/ there is only vtcon0 now.

Also has this issue possibly been fixed or addressed partially? I can reboot my RX580 but usually if I try to use the GPU like running a game, I get blank screens or the whole system hangs.
 
FOR AMD GPU only

do "lspci | grep VGA -A 1" for the correct pci id's

use this script and replace XX: with the for mentioned ids

Code:
#!/bin/bash
#
#replace xx\:xx.x with the number of your gpu and sound counterpart
#
#
echo "disconnecting amd graphics"
echo "1" | tee -a /sys/bus/pci/devices/0000\:XX\:00.0/remove
echo "disconnecting amd sound counterpart"
echo "1" | tee -a /sys/bus/pci/devices/0000\:XX\:00.1/remove
echo "entered suspended state press power button to continue"
echo -n mem > /sys/power/state
echo "reconnecting amd gpu and sound counterpart"
echo "1" | tee -a /sys/bus/pci/rescan
echo "AMD graphics card sucessfully reset"

do not this will suspend the proxmox host and you need to press the power button to resume, a remote machine will no longer respond
 
Last edited:
Thanks for this. When should this be run? When the VM is shut down? If it suspends the host then other VM's will also be suspended?
 
ah, yes. Run the script between VM with pass-through stop and start.

FYI: yesterday i wrote an angry mail at AMD for not clarifying how to pinpoint or resolve this since two cases exist. One blaming the AMD CPU (first gen Ryzen only) and the second blaming the AMD GPU (unknown which) but no definitive answer how to deduce from AMD afaik.

Know that upgrading the bios on your motherboard is key and may in some cases resolve the reset requirement.
 
Last edited:
OK, so firstly, to be 100% sure, if I run the above script, I need to do so after shutting down the VM with the AMD GPU passed through. And when I do run it, it will put the whole PC to sleep so any other running VM's will also be suspended? I am not sure on this last part.

And then also, this won't address reboots within the VM. If a reboot is needed, it's best to shut down?

Longer background - Previously, I had the 1050ti and RX580 in opposite slots. At that point, everything worked fine, no reset bug. I'm using an MSI B450M Mortar (can't remember BIOS revision off hand). I swapped the RX580 to the primary slot (x16) instead of (x8) and this is when the problem began. The script I pasted originally seemed to fix it.

Overall, until this issue, I've maybe been very lucky. I also have an RX570 in another X570 board with a 3600 and it works 100% fine also.

I know this is very much an edge case for AMD but I really do hope they fix it as virtualization is not going to get less important.
 
coincidentally i also have an MSI motherobard (x370) the bios is up to date
make sure Agesa 1.0.0.6 enabled BIOS is present to enable iommu groups which are essential for pass through if i understand correctly

to my understanding, powering down the VM is required
resettting the GPU or GPU's will suspend the proxmox host and the VM running
suspend is a 'mechnical' operation so no data should be lost as the system remains powerd but lets go of some controls

AMD knows this very well and want to monopolise it i guess, such bugs should really be quite easy to fix if people can come up with working patches in their spare time which essentially contain a work around for some stupid dumb ass developer mistake, or would that be marketing ?
 
OK, so firstly, to be 100% sure, if I run the above script, I need to do so after shutting down the VM with the AMD GPU passed through. And when I do run it, it will put the whole PC to sleep so any other running VM's will also be suspended? I am not sure on this last part.

And then also, this won't address reboots within the VM. If a reboot is needed, it's best to shut down?

Longer background - Previously, I had the 1050ti and RX580 in opposite slots. At that point, everything worked fine, no reset bug. I'm using an MSI B450M Mortar (can't remember BIOS revision off hand). I swapped the RX580 to the primary slot (x16) instead of (x8) and this is when the problem began. The script I pasted originally seemed to fix it.

Overall, until this issue, I've maybe been very lucky. I also have an RX570 in another X570 board with a 3600 and it works 100% fine also.

I know this is very much an edge case for AMD but I really do hope they fix it as virtualization is not going to get less important.
you should really consider upgrading the bios


Version 7B89v1E Release Date 2020-06-16 File Size 9.28 MB

Description

- Updated AMD AGESA ComboAm4PI 1.0.0.6
- Fix HDMI audio lost issue when use AMD RX570 vga card.
 
FWIW, i now have my RADEON VII working in pass through again for a linux VM. Feel free to setup a private chat to discuss your setup.
Paired with NX (nomachine.com) it works at extreme speed and high quality, even with the free version.
 
Last edited:
@Joris L. Did the BIOS upgrade and now unable to boot Proxmox. I'm getting the following error and boot is hanging:

Code:
vfio-pci 0000:26:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem

I re-enabled SVM and IOMMU in the BIOS but no luck. I tried a rescue boot from the Proxmox install ISO but it too hangs shortly after some vfio-pc logging. I guess I need to disable vfio-pci driver completely but I need to get to the root file system first to edit it? Any ideas what else I could try or how to get it to at least boot so I can start my other VM's?

EDIT 2: Booted Ubuntu live cd and mounted the root partition. I moved /etc/modprobe.d/vfio.conf to another directory. Boot seems to proceed further now until it gets to loading PVE firewall, API etc. and then hangs. Does the same thing also when trying a rescue boot. Looks like there is no network detected as it stalls on trying to mount an NFS share before proceeding. However, networking is fine in Ubuntu live cd.
 
Last edited:
@Joris L. Did the BIOS upgrade and now unable to boot Proxmox. I'm getting the following error and boot is hanging:

Code:
vfio-pci 0000:26:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem

I re-enabled SVM and IOMMU in the BIOS but no luck. I tried a rescue boot from the Proxmox install ISO but it too hangs shortly after some vfio-pc logging. I guess I need to disable vfio-pci driver completely but I need to get to the root file system first to edit it? Any ideas what else I could try or how to get it to at least boot so I can start my other VM's?

EDIT 2: Booted Ubuntu live cd and mounted the root partition. I moved /etc/modprobe.d/vfio.conf to another directory. Boot seems to proceed further now until it gets to loading PVE firewall, API etc. and then hangs. Does the same thing also when trying a rescue boot. Looks like there is no network detected as it stalls on trying to mount an NFS share before proceeding. However, networking is fine in Ubuntu live cd.

ouch, sorry to read this is happening
looking in my documentation i found "vfio-pci 0000:26:00.0: vgaarb: ...." to be normal during boot, it does not state error anyway

check if vfio.conf contains the right pci id,this may change on bios upgrade or reconfiguration

please share the NFS hang message, i think the boot command is simply not complete and not set to mount root filesystem correctly
 
Last edited:
I'm guessing the PCI IDs and the network device has changed also. I am not sure though because I can't get to a terminal. When I had vfio.conf in place, it hung on that exact line every time. Once removed, it hangs at different stages, usually when the pve services/daemons are starting. Thinking it was network, I reverted /etc/network/interfaces to loopback only but it's still hanging on boot.

Maybe I need to regenerate initramfs? If so, I'm not quite sure how to do this without being able to get to a terminal. Possibly I can boot with a live cd, mount the root partition and chroot it to try to regenerate initramfs?

I can get the boot parameters from grub but they're unchanged from prior to the BIOS upgrade. I tried editing them before boot, to remove pci override and iommu etc. just to see how to goes. No change :(
 
Last edited:
I'm guessing the PCI IDs and the network device has changed also. I am not sure though because I can't get to a terminal. When I had vfio.conf in place, it hung on that exact line every time. Once removed, it hangs at different stages, usually when the pve services/daemons are starting. Thinking it was network, I reverted /etc/network/interfaces to loopback only but it's still hanging on boot.

Maybe I need to regenerate initramfs? If so, I'm not quite sure how to do this without being able to get to a terminal. Possibly I can boot with a live cd, mount the root partition and chroot it to try to regenerate initramfs?

I can get the boot parameters from grub but they're unchanged from prior to the BIOS upgrade. I tried editing them before boot, to remove pci override and iommu etc. just to see how to goes. No change :(

well, it is unfortunate, that is sure.

know that if GPU passthrough works it looks like the system is hanging, you should be able to access it over ssh

typically a bios upgrade does not require configuration changes on the os/proxmox

when booting, press ESC at the bootscreen and review the boot prompt, does it show mount option for the filesystem ? Remove any or all parameters that are extra to that so it boots in a clean state

if the boot parameters for the filesystem(s) are missing you should first fix that
 
I updated my AMD AGESA Combo-AMD4 1.0.0.4 Patch B to AMD AGESA Combo-AM4 PI 1.0.0.6 and my working Ryzen 2700X with 2x GPU and USB passthrough broke completely.
Could not get it to work again and reverted back to the previous working version on my ASRock X470 Master SLI. Note that the five(!) versions previous to that one also broke PCI passthrough.
Sounds like you upgraded to a similar version of AMD AGESA and upgrading a BIOS can severly affect a Proxmox system. I do hope you get it to work again, but I could not.
 
Last edited:
@avw what were the symptoms of your issue? Could you boot? Did you try to reinstall?

I'm not sure I can revert as I remember MSI or AMD had blocked BIOS downgrading but I'll have to recheck.
 
I updated my AMD AGESA Combo-AMD4 1.0.0.4 Patch B to AMD AGESA Combo-AM4 PI 1.0.0.6 and my working Ryzen 2700X with 2x GPU and USB passthrough broke completely.
Could not get it to work again and reverted back to the previous working version on my ASRock X470 Master SLI. Note that the five(!) versions previous to that one also broke PCI passthrough.
Sounds like you upgraded to a similar version of AMD AGESA and upgrading a BIOS can severly affect a Proxmox system. I do hope you get it to work again, but I could not.

thanks for sharing, initially i had similar issues when arriving at Agesa 1.0.0.6, on an x370 motherboard, what is did was simply reset the bios by using the jumper on the motherboard, then reconfigure the bios

I'd try again and rebuild the cmdline and pass through configuration from scratch
 
@avw what were the symptoms of your issue? Could you boot? Did you try to reinstall?

I'm not sure I can revert as I remember MSI or AMD had blocked BIOS downgrading but I'll have to recheck.
I could boot, but system froze or rebooted when starting the previously working VMs with PCI pass through. I reset CMOS, loaded defaults and went over all settings. Some settings about IOMMU etc were no longer available in BIOS. I did not try a reinstall of Proxmox and trying to get pass through working again, because the system did not give me any error messages. Without a clue on what was broken and given that more AGESA versions did NOT work rather than did work, I just reverted. But I have a different brand and chipset than you.

@Joris L. Which 1.0.0.6 version? AMD AGESA 1.0.0.6 from end of 2018 or AMD AGESA Combo-AM4 PI 1.0.0.6 from summer 2020? They restarted version numbering serveral times... I wanted to use the first one, but my motherboard came with a later version (support for Zen 2), which would not let me downgrade and I had to use a Beta-version to get pass through to work.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!