[SOLVED] Non-simultaneous GPU Passthrough to different VMs requires proxmox reboot

firobad66

Member
Dec 17, 2021
23
2
8
42
Hi,

I have successfully passed through a PCIE GPU (Radeon R9 285) to a Windows 10 VM and to Ubuntu20.04
Maybe noteworthy is that I run proxmox on debian (with gnome desktop and other GPU) but of course with proxmox kernel (just followed the wiki instractions).
It runs fairly old AMD 6 core Phenom II 1090.
For the remote desktop I use nomachine on both VMs.

While I completely understand that I can't share this GPU to both of them at the same time, I assumed I could have them at least in one vm at a time and when this one is switched off, the other VM could use it.

However it turns out that this seems to require rebooting the debian/proxmox host when I did use the Windows VM successfully with the GPU, switched it off, then turn on the ubuntu VM, but the ubuntu won't come up:
- Not reachable with nomachine
- Not reachable via ping
Looking at the summary stats in the webui of the VM, I have a quite high RAM usage and one of the cores is fully busy (both VMs have assigned 3 cores). Htop reveals 99.x percent usage.

When I reboot the proxmox (as its stuck it takes some time) - it does again work as before. The other way around (i.e. start ubuntu VM with GPU, shut it down, then start Windows VM worked without a reboot)

Any idea where to start looking for this issue?
Let me know what files you need,

Thanks in advance for helping me out.
 
Maybe the AMD GPU reset bug?: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/
Most modern AMD GPUs suffer from the AMD Reset Bug: The card cannot be reset properly, so it can only be used once per host power-on. The second time the card is tried to be used Linux will attempt to reset it and fail, causing the VM launch to fail, or the guest, host or both to hang.
Edit:
Maybe also interesting: https://hardforum.com/threads/fix-for-older-amd-gpus-reset-bug.2011343/
 
Last edited:
  • Like
Reactions: firobad66
I have tried the solution from the second link as this seems to be apropriate to my old gpu (Tonga Chip), unfortunately it doesn't work:

Once the windows VM is shutdown, no chance to get it back working or start the ubuntu vm.

In addition when using ubuntu vm first after host reboot - i was again able to shut ubuntu vm down and start windows vm successfully.

However as soon as i turn of windows vm, I can't neither turn on windows vm nor ubuntu. Both vms will use one core to nearly 100% and have a high ram usage. They don't come up, no pic is shown on the screen attached to the hdmi output of the card. Only a host reboot will help then.

Maybe noteworthy both VMs use SeaBios.

Any other ideas how to fix that as its truly annoying?


Some more info from myside:
dmesg output while trying to start vm:
Code:
[ 3665.651113] device tap100i0 entered promiscuous mode
[ 3665.697410] vmbr0: port 2(fwpr100p0) entered blocking state
[ 3665.697419] vmbr0: port 2(fwpr100p0) entered disabled state
[ 3665.697554] device fwpr100p0 entered promiscuous mode
[ 3665.697611] vmbr0: port 2(fwpr100p0) entered blocking state
[ 3665.697614] vmbr0: port 2(fwpr100p0) entered forwarding state
[ 3665.703115] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 3665.703122] fwbr100i0: port 1(fwln100i0) entered disabled state
[ 3665.703312] device fwln100i0 entered promiscuous mode
[ 3665.703429] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 3665.703432] fwbr100i0: port 1(fwln100i0) entered forwarding state
[ 3665.711044] fwbr100i0: port 2(tap100i0) entered blocking state
[ 3665.711052] fwbr100i0: port 2(tap100i0) entered disabled state
[ 3665.711263] fwbr100i0: port 2(tap100i0) entered blocking state
[ 3665.711267] fwbr100i0: port 2(tap100i0) entered forwarding state
[ 3670.655588] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 3670.655604] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 3740.562633] fwbr100i0: port 2(tap100i0) entered disabled state
[ 3740.583125] fwbr100i0: port 1(fwln100i0) entered disabled state
[ 3740.583356] vmbr0: port 2(fwpr100p0) entered disabled state
[ 3740.583498] device fwln100i0 left promiscuous mode
[ 3740.583501] fwbr100i0: port 1(fwln100i0) entered disabled state
[ 3740.611842] device fwpr100p0 left promiscuous mode
[ 3740.611852] vmbr0: port 2(fwpr100p0) entered disabled state

my /etc/modules:
Code:
vfio
vfio_iommu_type1
vfio_pci ids=1002:6939,1002:aad8
vfio_virqfd

my /etc/default/grub:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on vfio-pci.ids=1002:6939,1002:aad8"

my /etc/modprobe.d/vfio_pci.conf :
Code:
options vfio-pci ids=1002:6939,1002:aad8
softdep amdgpu pre: vfio-pci
 
Last edited:
Don't put ids=1002:6939,1002:aad8 in /etc/modules, either put them on the kernel command line (as you also did) or put them in /etc/modprobe.d/vfio_pci.conf (as you also did). Also, you don't need amd_iommu=on, which isn't even supported as it is on by default.
I remember people talking about "ejecting the GPU device on Windows with a script" before shutting down the VM as a work-around for issues like yours. Unfortunately, I have no experience with Windows or your issue myself, but maybe you can try the work-around explained here?
 
update:
I have found the manual way to shutdown the Windows VM and being able to reboot it again afterwards. So thats at least a partial success! No more host reboots necessary!

I connect via MS remote desktop (ms rdp addon for remmina) client to the Win VM and eject the Radeon Card and its HDMI Audio Device in the tray at the usb icon.
Then i shut it down and can start it again.
With nomachine it doesn't work (as it uses the gpu, you basically lock yourself out after you click on eject radeon card)

So its probably an issue of those start/stop scripts in the vm (i had tested them, they seemed successful - at least the output didnt complain). No idea why they dont work...

As ubuntu isn't affected by this 'reset bug' - I wonder if there is any software solution for this to automate?
Not really a pro in windows but isn't there something similar as ssh'ing into the machine running commands to eject the gpu and then to shut it down?

And why does the bug not appear with the ubuntu vm? I shut it down and then i can start windows vm...
 
Don't put ids=1002:6939,1002:aad8 in /etc/modules, either put them on the kernel command line (as you also did) or put them in /etc/modprobe.d/vfio_pci.conf (as you also did). Also, you don't need amd_iommu=on, which isn't even supported as it is on by default.
I remember people talking about "ejecting the GPU device on Windows with a script" before shutting down the VM as a work-around for issues like yours. Unfortunately, I have no experience with Windows or your issue myself, but maybe you can try the work-around explained here?
Is that more of an 'eye candy' to change my config on the host or truly necessary? Do you think that has to do with my issues?
I am a bit afraid of touching it at the moment, you know there is the saying 'never change a running system'

My start/shutdown scripts are exactly as in the post that you have linked. For some reason that I dont understand they dont work.
If I run them manually in this windows cmd as admin, they output something that sounds OK 'device will be disabled when service bla bla'
 
Is that more of an 'eye candy' to change my config on the host or truly necessary? Do you think that has to do with my issues?
I am a bit afraid of touching it at the moment, you know there is the saying 'never change a running system'
It's just a suggestion to clean-up your configuration. Although I think that your /etc/modules is actually incorrect, it appears to work fine. This is indeed not related to your problem, sorry.

Your problem is most likely that the AMD Windows drivers leave the GPU in a state that it cannot properly reset from, while the open-source Linux drivers put the hardware in a state that it can recover from. I do believe there is more active development (especially for older GPUs) on the open-source drivers than on Windows.
My start/shutdown scripts are exactly as in the post that you have linked. For some reason that I dont understand they dont work.
If I run them manually in this windows cmd as admin, they output something that sounds OK 'device will be disabled when service bla bla'
Seems you found the same work-around as I just did already. I think that website explains how to do this from the command line, which you can probably put in a Windows script that ends with a shutdown command. I'm sorry but I have no recent experience with Windows scripting and I don't know how to find you a working script on the internet right now.
 
  • Like
Reactions: firobad66
Thanks a lot for your reply and help. Much appreciated!

I think we can mark this issue here as solved as its related more to the Windows VM itself then any of the proxmox world.

Not sure if i will really want to bother myself with windows scripting, the solution is manual but the win vm is anyway mostly for gaming...
 
I finally managed to shutdown the VM with a script.
In the end i had to create a shutdown script that performs disabling and removing, i.e. something like this:

Code:
devcon64.exe disable "<deviceid>"
devcon64.exe remove "<deviceid>"
shutdown.exe /s /t 05

So we can mark this thread as solved
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!