GPU passthrough kraching kernel and softlock CPU

ola

Renowned Member
Mar 6, 2013
20
0
66
Hi!

i have a bit of a problem, when having load on my AMD RX Vega 64 GPU (Adrenaline 19.10.1 driver) that is passed through to a Windows 10 (1903) VM it crashed the kernel after a while on the host.
if the windows machine just ideling and not rendering more then the windows desktop it is stable for a long time.

This is pressent on both VE 5.4 and on VE 6.0

i manage to snag this up from the syslog on the host. (To long to post here so on Pastebin)
https://pastebin.com/TjnzGP4c


The Host is a SuperMicro Quad Xeon machine

Some Configuration of the Host:

root@pmox1:~# cat /etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist amdgpu
blacklist snd_hda_intel

root@pmox1:~# lspci -nnv
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1462:3680]
Flags: bus master, fast devsel, latency 0, IRQ 97, NUMA node 0
Memory at a0000000 (64-bit, prefetchable) [size=256M]
Memory at b0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 4000 [size=256
Memory at bb900000 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at bb980000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Flags: bus master, fast devsel, latency 0, IRQ 94, NUMA node 0
Memory at bb9a0000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

I have tryed to google "DMAR: DRHD: handling fault status reg 40" but it seams that this error never have been writen about before.

Do anyone have any input how to solve this or where to look?

Thanks alot for reading!
 
I will try it but im not shure it will work becuase i have exacly the same issue on VE 5.4 before i upgraded to 6.0 (i upgraded to 6.0 to see if newer kernel would solve the issue) so the issue is not new in 6.0 it's just still there :)
 
Tryed the method "qm set ID -args '-machine type=q35,kernel_irqchip=on'" but as soon i start the VM with that setting i get "DMAR: DRHD: handling fault status reg 20" so diffrent error and all cpu core softlocks.

Without the ",kernel_irqchip=on" part i can now run about 22 hours without a hard reboot.
 
Could be a hardware error, or related to the somewhat unique architecture of Quad Xeon board.

I'd also check the BIOS for related options, PCI ACS, AER, IOMMU are usually good candidates to try.
 
i have talked with SuperMicro Support and the settings in the Bios is what is recomended to run PCI-E passthrough or a must for it even to be able too.

I think it could help a bit just know what "DMAR: DRHD: handling fault status reg 20" and "DMAR: DRHD: handling fault status reg 40" even means.

Thanks for taking time to try to help!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!