AMD Reset Bug on Linux vm Only

NLay

New Member
Aug 12, 2021
8
0
1
44
Good day,

I have a weird issue that I cannot seem to understand. I have a Hades Canyon NUC [NUCi7HVK] which has both the Intel (igpu) and the AMD Radeon RX Vega M GH gpu (which is identified as Polaris 22 XT). I have the latest BIOS from Intel (version 67, if I remember correctly)

I installed ProxMox 6.4-13 and everything went smoothly with passing through the AMD gpu to a Windows 10 vm ( I can see the output on the screen and I can reboot/shut down the vm without any hiccups once I installed the AMD drivers.

However, whenever I try to passthrough the AMD gpu to any linux flavor ( I tried Ubuntu 18.04/20.04, Fedora 32,33 and 34) I can see output on the screen once per host power cycle, meaning that if I went ahead and rebooted the VM, the host freezes and I have to manually reboot it. If I shut it down, everything works, until I try to start it up again and the host still freezes as well.

Here is what I have tried so far:
1- Install vendor-reset by gnif: Resulted in no change.
2- I looked at the older Navi reset kernel patch both 1st and 2nd versions: Still no change.
3- When I created the VM, I made the passed through GPu the primary gpu: Once I removed that from the vm conf file, the host wouldn't freeze but I can see no output on the screen. I can only see the desktop on the console. If I run lspci -v -s 01:00 (the pci device number for the gpu) I get "!!! Unknown header type 7f" both inside the vm and on the host.
4- I tried providing ProxMox with the vbios rom of the gpu: First I tried to do that on ProxMox, but since the AMD gpu is the primary GPU of the NUC, It seems that the rom is shadowed and I always end up with input/output error when writing the rom to a file. So, I went ahead and installed Windows, AMD drivers and then GPU-Z and I was successful getting the rom file from there. ---- Still I end up with the same result. (the rom file works fine for a Windows 10 vm but not any Linux vms.
5- I even tried to use KVM on fedora 33/34 and then tried ESXi 6.7/7 and I get the same results more or less.


I am at the end of my wits here and I would appreciate any pointers even if just to explain that weird behavior.

All the best,
 
Just my thoughts on your questions:
1. vendor-reset does wonders for supported AMD GPUs, but the Vega M (which is indeed not a Vega but a Polaris type) does not seem to be supported.
2. The older reset patches were never perfect, failed for various cards and were not meant for Polaris, if I remember correctly.
3. The Primary GPU option does a lot for NVidia cards but. I my experience, should not be used for AMD ones. Just passthrough the GPU and set Display to None (or add a virtual serial console).
4. Indeed, you need to make sure the card is not initialized before reading the ROM, which means you cannot do it when it is used during boot. Sometimes you can find a compatible ROM file at the TechPowerUp. In my experience, you don't need the ROM for AMD cards.
5. Your particular choice of hardware for virtualization is probably uncommon and not well supported.

I think that the Vega M is left in a hardware state by the Linux drivers (inside the VM, when shutting down the Linux VM) that those same drivers cannot handle (when starting the VM) or the Proxmox host cannot properly reset the device from. It would appear that the Windows drivers either can handle that situation or don't leave the device in a state that requires a power cycle to reset. You could try starting the Windows VM after shutting down the Linux VM and see if that properly resets the GPU.

It is not uncommon that devices don't reset properly and it appears that PCI passthrough is not something hardware vendors test for. Often, this can be worked around by not using the device and binding it to vfio-pci before any other driver can touch it However, you GPU is used during POST/BIOS/boot and therefore needs to be reset, and it does do that in a good enough way at least once. Is there no way to change the primary GPU in the BIOS?

So your problem is probably a Linux driver that cannot handle the state it finds the device in after a shutdown. Please check if the Windows VM can be used as a work-around. If you can make the system boot from the Intel iGPU, you can try the ROM-file and/or preventing the device from being used by the host.
 
Just my thoughts on your questions:
1. vendor-reset does wonders for supported AMD GPUs, but the Vega M (which is indeed not a Vega but a Polaris type) does not seem to be supported.
2. The older reset patches were never perfect, failed for various cards and were not meant for Polaris, if I remember correctly.
3. The Primary GPU option does a lot for NVidia cards but. I my experience, should not be used for AMD ones. Just passthrough the GPU and set Display to None (or add a virtual serial console).
4. Indeed, you need to make sure the card is not initialized before reading the ROM, which means you cannot do it when it is used during boot. Sometimes you can find a compatible ROM file at the TechPowerUp. In my experience, you don't need the ROM for AMD cards.
5. Your particular choice of hardware for virtualization is probably uncommon and not well supported.

Thank you @avw for sharing your experience. I appreciate these points. I will definitely keep them in mind in future projects.

I think that the Vega M is left in a hardware state by the Linux drivers (inside the VM, when shutting down the Linux VM) that those same drivers cannot handle (when starting the VM) or the Proxmox host cannot properly reset the device from. It would appear that the Windows drivers either can handle that situation or don't leave the device in a state that requires a power cycle to reset. You could try starting the Windows VM after shutting down the Linux VM and see if that properly resets the GPU.

It is not uncommon that devices don't reset properly and it appears that PCI passthrough is not something hardware vendors test for. Often, this can be worked around by not using the device and binding it to vfio-pci before any other driver can touch it However, you GPU is used during POST/BIOS/boot and therefore needs to be reset, and it does do that in a good enough way at least once. Is there no way to change the primary GPU in the BIOS?

I tried that approach. However, the only option there is to enable/disable the iGPU (Intel). Moreover, the HDMI output is only connected to the AMD gpu.

So your problem is probably a Linux driver that cannot handle the state it finds the device in after a shutdown. Please check if the Windows VM can be used as a work-around. If you can make the system boot from the Intel iGPU, you can try the ROM-file and/or preventing the device from being used by the host.

I totally agree with you that the main issue is how Linux drivers leave the gpu in an unreadable state (maybe they send it to sleep D3 and unable to wake it up). Regarding TechPowerUp, I found three versions of the AMD gpu and I tried all three of them. Unfortunately, that didn't work. I got the same results that I got with the rom that I created using GPU-z on Windows.
 
Last edited:
I figured out that running setpci on host can recover the gpu and I am able to start up the Ubuntu vm.

Since Windows AMD drivers can recover the gpu from within the VM (i.e., no host scripts are required), how would I be able to run setpci on the Linux vm prior to the binding of the amdgpu kernel driver, as once the amdgpu binds to the rebooted gpu, the host and vm freeze.

In a nutshell, is it possible to run setpci at boot before the kernel module amdgpu attaches itself to the gpu itself ?
 
Since Proxmox is also a Linux, just run the command in the pre-start phase of a hookscript?

PS: Can you share the exact setpci command that fixes this issue (for other future users)?
 
Last edited:
Could
I figured out that running setpci on host can recover the gpu and I am able to start up the Ubuntu vm.

Since Windows AMD drivers can recover the gpu from within the VM (i.e., no host scripts are required), how would I be able to run setpci on the Linux vm prior to the binding of the amdgpu kernel driver, as once the amdgpu binds to the rebooted gpu, the host and vm freeze.

In a nutshell, is it possible to run setpci at boot before the kernel module amdgpu attaches itself to the gpu itself ?
Could you include the steps you followed to solve the problem mentioned in your post, and close it?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!