General Protection fault with GPU Passthrough since recent update

BrunoC

New Member
Nov 26, 2019
2
0
1
44
Dear community,
I would like to make all my computer converge to one machine. I was able to configure passthrough but with recent versions of Proxmox it does not work anymore for me. There is a general protection fault in Proxmox when I launch the VM with the graphic card. Of course, the VM does not actually launch.

A bit of context : I am using a Dual Xeon motherboard (SuperMicro X9Dri-LN4F+), 2 Xeon 2690 V2 and 56 GO of RAM. I am dedicating an Asus Radon Vega 56 Strix for a windows VM, using passthrough. The card is hooked to a separate screen via HDMI and Proxmox uses the integrated matrox videocard. More details about the config below.
Several other VM are / will be installed, but only using CPU and no passthrough. ZFS will be used for managing storage (spinning disks and NVME SSD).
The problem, is – obviously- the gaming part of the build. I was (and still am) able to achieve a very satisfying performance under native windows. However, when moving to virtualization, I had a big dip in FPS but also huge stuttering and that is what is bothering me. Of course I did my test without any other active VM / container or user process running at the same time.

Issue : since- at least- the last 2 minor versions of Proxmox, I am not able to run passthrough at all. The VM crashes systematically at launch even before the VM is launched (no OVMF/bios screen is displayed) windows with a General Protection fault message displayed in the PVE console (followed by long screen of information that you can find in the Dmesg.txt file).

May be both of my problems are linked (performance and crash) and recent versions of Proxmox are not as tolerant as they used to be ? Whatever the problem is, I am not able to tune the VM performance as it simply does not work anymore.

Please note that, otherwise, the system is stable and the Windows VM works perfectly fine if the dedicated graphic card passthrough(-ed ?).

What I tried so far :
  • I have tried to reset the bios to default, to no avail (still on default mode though)
  • I have tried to change the GPU slot (using slots from CPU 1 and 2) and reconfigure Passthrough. I followed official Passtrough Wiki => same issue. You will find attached the current
  • Bios and Proxmox are uptodate
  • I used MemTest86+ => no error detected
  • The issue is not related to PCIE needing to add 0000 as a prefix for the identifier (like here : https://forum.proxmox.com/threads/pci-passthrough-not-working-after-update.60580/).

Thank you in advance for your help. Please let me know if you need more


Configuration :

CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz (2 Sockets)
Kernel Version Linux 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100)
PVE Manager Version pve-manager/6.0-15/52b91481
56 GO RAM
LSI SATA / SAS Card with 4 HDD
4 NVME drives in a PCIE16x card in 4x4x4x4 mode.
Asus AREZ Strix AMD Vega 56 with 8GO of VRAM
Some SSD used for the “bare metal” windows. I ran them in dual boot, for testing purpose.
Proxmox is installed on a dedicated sata SSD (plugged on the integrated intel RST controller) and was reinstalled several times

Bios and system are up to date, as of the 22nd of november 2019

Syslog : attached
Dmesg : attached
VM Config : attached
 

Attachments

  • dmesg.txt
    121.3 KB · Views: 5
  • syslog.txt
    44.8 KB · Views: 3
  • 100.conf.txt
    623 bytes · Views: 2
Looking at your dmesg.txt, the line
Code:
[  296.986151] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon
seems to indicate your problem. You can't be using the AMD GPU on your Proxmox host when attempting passthrough.

Try blacklisting the amdgpu driver on your host, or manually binding vfio-pci in early boot phases to pre-claim the device. (See our docs)

You can check if the amdgpu driver is currently using the card with lspci -nnk.
 
Hello Stefan, many thanks for your quick reply.
I actually forgot to blacklist the driver this time. For your information, when it was working previously, I did notice that it was necessary.

However, I blacklist the radeon and nouveau driver using the command in the wiki.
Code:
echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf 
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf

Result :
Code:
root@pve:~# lspci -nnk | grep amd
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

So, I also added
blacklist amdgpu
in the pve-blacklist file.
then, ran
Code:
update-initramfs -k all -u
and reboot

Hurra ! It works. Thank you Stefan. I will now run some banchmarks to check everything is alright.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!