GPU Passthrough - Host Crash

Temtaime

Active Member
Jan 17, 2017
28
3
43
30
Hello.
I'm using GPU passthrough for a while, but today i has my host crashed.

There was just two lines on the host before a crash.
Oct 23 11:52:02 pve kernel: [1034431.681561] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xebf0c35000 flags=0x0030]
Oct 23 11:52:02 pve kernel: [1034431.681573] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xebf0c34800 flags=0x0030]

I'm using R7 240 with Ryzen 3600 with ECC memory.
VM config:
bios: ovmf
hostpci0: 0000:07:00,pcie=1,x-vga=1
machine: q35
vga: none

# cat /proc/cmdline
initrd=\EFI\proxmox\5.11.22-5-pve\initrd.img-5.11.22-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs iommu=pt video=efifb:eek:ff mitigations=off

# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/7/devices/0000:00:08.0
/sys/kernel/iommu_groups/5/devices/0000:00:07.0
/sys/kernel/iommu_groups/13/devices/0000:09:00.1
/sys/kernel/iommu_groups/3/devices/0000:00:04.0
/sys/kernel/iommu_groups/11/devices/0000:08:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/8/devices/0000:00:08.1
/sys/kernel/iommu_groups/6/devices/0000:00:07.1
/sys/kernel/iommu_groups/14/devices/0000:09:00.3
/sys/kernel/iommu_groups/4/devices/0000:00:05.0
/sys/kernel/iommu_groups/12/devices/0000:09:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.1
/sys/kernel/iommu_groups/2/devices/0000:07:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/2/devices/0000:07:00.1
/sys/kernel/iommu_groups/10/devices/0000:00:18.3
/sys/kernel/iommu_groups/10/devices/0000:00:18.1
/sys/kernel/iommu_groups/10/devices/0000:00:18.6
/sys/kernel/iommu_groups/10/devices/0000:00:18.4
/sys/kernel/iommu_groups/10/devices/0000:00:18.2
/sys/kernel/iommu_groups/10/devices/0000:00:18.0
/sys/kernel/iommu_groups/10/devices/0000:00:18.7
/sys/kernel/iommu_groups/10/devices/0000:00:18.5
/sys/kernel/iommu_groups/0/devices/0000:03:00.0
/sys/kernel/iommu_groups/0/devices/0000:02:00.2
/sys/kernel/iommu_groups/0/devices/0000:02:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/0/devices/0000:01:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.3
/sys/kernel/iommu_groups/0/devices/0000:02:00.1
/sys/kernel/iommu_groups/0/devices/0000:00:01.1
/sys/kernel/iommu_groups/0/devices/0000:05:00.0
/sys/kernel/iommu_groups/0/devices/0000:03:01.0
/sys/kernel/iommu_groups/0/devices/0000:03:04.0
/sys/kernel/iommu_groups/9/devices/0000:00:14.3
/sys/kernel/iommu_groups/9/devices/0000:00:14.0

07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland PRO [Radeon R7 240/340] (rev 87) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Oland PRO [Radeon R7 240/340]
Flags: bus master, fast devsel, latency 0, IRQ 72, IOMMU group 2
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at fce00000 (64-bit, non-prefetchable) [size=256K]
I/O ports at e000
Expansion ROM at fce40000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] Physical Resizable BAR
Capabilities: [270] Secondary PCI Express
Kernel driver in use: vfio-pci
Kernel modules: radeon, amdgpu


VM syslog: https://pastebin.com/Dtumhyst

What's wrong? It is a production environment so what can i do to isolate such a crash only to VM, not a host ?
 
Last edited:
I'm afraid you can't fully isolate a VM with passthrough, but I would love to be proven wrong (with an explanation on how to do it).

Looking at the GPU lockup messages from inside the VM, I would say some software did somethings to the GPU that caused it to lockup and could not get it to recover. It then tries to do a hard reset of the GPU, which does not succees and worse: the GPU does something (via the PCIe bus?) that causes the Proxmox host to crash. I've had better luck with AMD GPUs without x-vga=1, but I don't think this is related to this problem. Sometimes sites like ShaderToy, which can be very GPU intensive, crashes my GPU/browser/VM and sometimes also takes the Proxmox host with it.

I can only advise to keep you software and drivers/kernel inside the VM as up to date as possible because maybe they fix this issue, or revert to a known good version and update only after extensive testing.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!