[SOLVED] GPU Passthrough Causing System Crash (PVE 5.0)

eBell

Member
Jun 11, 2017
13
0
21
Hello I have been trying to get GPU passthrough working on my GTX 760, and I'm having issues with PVE system crashes.
I have followed the steps outlined in the ProxMox Wiki, and sshaikh's Tutorial thread.
Everything seems to be performing as expected, but as soon as the VM loads in a GPU driver my ProxMox system output is spammed with PCIe Bus Errors, and AER Corrected errors:
Code:
pcieport 0000:00:02.0: AER: Corrected error received: id=0010
pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Transmitter ID)
pcieport 0000:00:02.0:   device [8086:3c04] error status/mask=00001000/00002000
pcieport 0000:00:02.0:    [12] Replay Timer Timeout
The system will run for a few hours before the entire PVE system locks up, and a hard reboot is required to restore the system.
I've tried a few things, such as making sure the GPU is in an isolated IOMMU group (which cleared up a lot of performance issues I was having), and I tried adding the 'ignore_msrs=1' option to the kvm.config but this didn't seem to do anything.
My server specs are:
  • CPU: 2x Xeon E5-2670 V1
  • Mobo: Asrock EP2C602-4L/D16
  • GPU: MSI GTX 760 2GB
  • OS: PVE 5.0
 
After doing a lot more digging I've made some progress, and I'm going to document it here so it can help others.

Firstly I found a few RAM errors after digging through the syslog that lead me to a faulty RAM module, but this was only exacerbating the crashes and not causing them.

In my OP I have missed a key piece of information, and that's one of the syslog errors:
Code:
pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
pcieport 0000:00:02.0:   device [8086:3c04] error status/mask=00000040/00002000
pcieport 0000:00:02.0:    [ 6] Bad TLP

The '[ 6] Bad TLP' is what lead me to the fix.
From what I've read 'Bad TLP' errors are the result of a corruption in the packet encapsulation between the PCIe device and the controller, and these errors are common with NVIDIA GPUs (possibly AMD GPUs too) on X99 & C210 chipset motherboards.

My ASRock EP2C602-4L/D16 uses the C602 chipset, and likely an earlier version of the C210 arch, so it can behave in a similar fashion.

After sifting through several forum threads on the NVIDIA forums I found a fix that has worked (so far) for me:
Adding 'pcie_aspm=off' to the GRUB bootloader will suppress the messages and the issue.
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on efifb:off pcie_aspm=off"

I've been testing it pretty heavily for a few days and the only crashes have been contained to the VM, and only occur when launching a specific game.
 
Last edited:
great that you found a solution :)

I've been testing it pretty heavily for a few days and the only crashes have been contained to the VM, and only occur when launching a specific game.
may i ask what crashes and what game?
i noticed that i need to enable the ignore_msrs option for the kvm module for assassins creed origins for example
 
It was Mirrors Edge Catalyst.

I think I got it to launch properly last night, but I've not checked properly.
I think adding 'ignore_msrs=1' to the kvm.conf might have alleviated the issue, but I'll test the game later to make sure.

EDIT: I also noticed that my 'efifb: off' setting in GRUB is wrong, and I have since changed it to
Code:
video=efifb:off
 
Last edited:
After some further testing, I am still getting PVE system lockups that are caused by the GPU VM, as they do not occur when the VM is offline, but are merely less common than they were.
I have been unable to find anything in the syslog that shows any errors, so disabling ASPM may be suppressing the TLP errors and not circumventing them.

EDIT: I have managed to provoke a crash while the GPU VM was offline, but without any errors in the syslog I am unsure where I should be looking.
It looks like I might have to plan out a reinstall of PVE.

EDIT2: After reinstalling PVE I was able to reproduce the crash pretty easily, so I'm looking into hardware faults causing additional faults.
My Mobo BMC logs several critical low voltage warnings on the PSU's 5V rail, so I'm going to upgrade my Delta to a modern PSU and continue testing.

EDIT3: Turns out that I've been pushing my old 800W Delta PSU a little too hard with my new setup and one of the ATX PSU pins had some pretty severe scorch marks on it.
I've replaced the PSU with a Corsair RMX 850, and the system is now completely stable.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!