GPU Passthrough - Host Crash

Temtaime · Oct 24, 2021

Hello.
I'm using GPU passthrough for a while, but today i has my host crashed.

There was just two lines on the host before a crash.

Oct 23 11:52:02 pve kernel: [1034431.681561] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xebf0c35000 flags=0x0030]
Oct 23 11:52:02 pve kernel: [1034431.681573] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xebf0c34800 flags=0x0030]

I'm using R7 240 with Ryzen 3600 with ECC memory.
VM config:

bios: ovmf
hostpci0: 0000:07:00,pcie=1,x-vga=1
machine: q35
vga: none

# cat /proc/cmdline
initrd=\EFI\proxmox\5.11.22-5-pve\initrd.img-5.11.22-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs iommu=pt video=efifbff mitigations=off

# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/7/devices/0000:00:08.0
/sys/kernel/iommu_groups/5/devices/0000:00:07.0
/sys/kernel/iommu_groups/13/devices/0000:09:00.1
/sys/kernel/iommu_groups/3/devices/0000:00:04.0
/sys/kernel/iommu_groups/11/devices/0000:08:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/8/devices/0000:00:08.1
/sys/kernel/iommu_groups/6/devices/0000:00:07.1
/sys/kernel/iommu_groups/14/devices/0000:09:00.3
/sys/kernel/iommu_groups/4/devices/0000:00:05.0
/sys/kernel/iommu_groups/12/devices/0000:09:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.1
/sys/kernel/iommu_groups/2/devices/0000:07:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/2/devices/0000:07:00.1
/sys/kernel/iommu_groups/10/devices/0000:00:18.3
/sys/kernel/iommu_groups/10/devices/0000:00:18.1
/sys/kernel/iommu_groups/10/devices/0000:00:18.6
/sys/kernel/iommu_groups/10/devices/0000:00:18.4
/sys/kernel/iommu_groups/10/devices/0000:00:18.2
/sys/kernel/iommu_groups/10/devices/0000:00:18.0
/sys/kernel/iommu_groups/10/devices/0000:00:18.7
/sys/kernel/iommu_groups/10/devices/0000:00:18.5
/sys/kernel/iommu_groups/0/devices/0000:03:00.0
/sys/kernel/iommu_groups/0/devices/0000:02:00.2
/sys/kernel/iommu_groups/0/devices/0000:02:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/0/devices/0000:01:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.3
/sys/kernel/iommu_groups/0/devices/0000:02:00.1
/sys/kernel/iommu_groups/0/devices/0000:00:01.1
/sys/kernel/iommu_groups/0/devices/0000:05:00.0
/sys/kernel/iommu_groups/0/devices/0000:03:01.0
/sys/kernel/iommu_groups/0/devices/0000:03:04.0
/sys/kernel/iommu_groups/9/devices/0000:00:14.3
/sys/kernel/iommu_groups/9/devices/0000:00:14.0

07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland PRO [Radeon R7 240/340] (rev 87) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Oland PRO [Radeon R7 240/340]
Flags: bus master, fast devsel, latency 0, IRQ 72, IOMMU group 2
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at fce00000 (64-bit, non-prefetchable) [size=256K]
I/O ports at e000
Expansion ROM at fce40000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] Physical Resizable BAR
Capabilities: [270] Secondary PCI Express
Kernel driver in use: vfio-pci
Kernel modules: radeon, amdgpu

VM syslog: https://pastebin.com/Dtumhyst

What's wrong? It is a production environment so what can i do to isolate such a crash only to VM, not a host ?

leesteken · Oct 24, 2021

I'm afraid you can't fully isolate a VM with passthrough, but I would love to be proven wrong (with an explanation on how to do it).

Looking at the GPU lockup messages from inside the VM, I would say some software did somethings to the GPU that caused it to lockup and could not get it to recover. It then tries to do a hard reset of the GPU, which does not succees and worse: the GPU does something (via the PCIe bus?) that causes the Proxmox host to crash. I've had better luck with AMD GPUs without x-vga=1, but I don't think this is related to this problem. Sometimes sites like ShaderToy, which can be very GPU intensive, crashes my GPU/browser/VM and sometimes also takes the Proxmox host with it.

I can only advise to keep you software and drivers/kernel inside the VM as up to date as possible because maybe they fix this issue, or revert to a known good version and update only after extensive testing.

GPU Passthrough - Host Crash

Temtaime

Renowned Member

leesteken

Distinguished Member

We value your privacy