Host crashes when GPU is under load

Prefix4138

New Member
Aug 14, 2023
4
0
1
I'm running Proxmox VE 8.1.3 and have done a single GPU passthrough to a Windows 11 VM. It was successful, but not stable. Whenever I play a game such as GTA V, the host will crash after about 10 or so minutes of playing. I've tried several things suggested on the Proxmox Wiki:
  • Verified the GPU only shares its IOMMU group with an audio device and PCI bridge
  • Passed the device IDs of the GPU and audio device to the vfio-pci modules in '/etc/modprobe.d/vfio.conf'
  • Blacklisted the nvidia, nouveau, and snd_hda_intel drivers at '/etc/modprobe.d/blacklist.conf'
  • Used the romfile option to give the VM a patched ROM/vbios, since the GPU will always be initialized by the host
  • Enabled IRQ remapping in x2apic mode rather than xapic mode by adding 'intremap=no_x2apic_optout' to '/etc/default/grub'
    • Note: This causes an additional message to appear in dmesg: "DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping."
  • Verified the GPU is UEFI (OVMF) compatible through rom-parser
    • GPU-Z and Techpowerup do say that the ROM/vbios of this GPU is not UEFI-compatible, despite the Proxmox Wiki stating it would be if rom-parser displayed 'type 3' in its results
    • I attempted to make it "more" UEFI-compatible with GOPUpd, but it was still shown as not UEFI-compatible on GPU-Z (in a VM using the romfile option to avoid flashing)
Despite this, the host continues to crash when the GPU is under load while in the VM.

Hardware
CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
GPU: GeForce GTX 1070 Mobile

Grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt intremap=no_x2apic_optout"

Dmesg
Code:
DMAR: IOMMU enabled
DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
DMAR-IR: Enabled IRQ remapping in x2apic mode
DMAR: Intel(R) Virtualization Technology for Directed I/O

Groups
Code:
IOMMU Group 1:
    00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 07)
    01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104BM [GeForce GTX 1070 Mobile] [10de:1be1] (rev a1)
    01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)

VFIO-PCI Modules
Code:
options vfio-pci ids=10de:1be1,10de:10f0

Blacklist
Code:
blacklist noveau
blacklist nvidia
blacklist snd_hda_intel

UEFI (OVMF) compatible
Code:
PCIR: type 3 (EFI), vendor: 10de, device: 1be1, class: 030000

Windows VM configuration
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 4
cpu: host
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:01:00,pcie=1,romfile=patched.rom
machine: pc-q35-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=xx
name: windows
net0: virtio=BC:24:11:93:C6:D2,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: local-lvm:vm-100-disk-1,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=8d392251-2ed1-4297-9248-171863f557a2
sockets: 1
tpmstate0: local-lvm:vm-100-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 63083e1a-4f1f-4d0b-bda2-b66112623072
 
Last edited:
I attempted to test this with a SeaBIOS-based Windows 11 VM instead, because of the missing UEFI-compatibility shown on GPU-Z and Techpowerup, but it still crashes the host when the GPU is under load.
 
I attempted to test this with a SeaBIOS-based Windows 11 VM instead, because of the missing UEFI-compatibility shown on GPU-Z and Techpowerup, but it still crashes the host when the GPU is under load.
Sounds like a hardware issue, thermal protection or insufficient (or old) power supply.
Is this a laptop? Maybe the VM does not use the special drivers that are intended to keep the power usage of the GPU low enough for the thermal/power envelope of the system?
 
Sounds like a hardware issue, thermal protection or insufficient (or old) power supply.
Is this a laptop? Maybe the VM does not use the special drivers that are intended to keep the power usage of the GPU low enough for the thermal/power envelope of the system?
Yes, it is a laptop. An ASUS GL502VS. It does have an insufficient power supply, as the charger that came with the laptop doesn't supply enough power for it to not also rely on the battery when it's under heavy load (i.e. gaming). But it's always been like this, so this is normal behavior.

For testing’s sake, I had the laptop under heavy load on a close to base Windows 11 install for about ~20 minutes, and it hasn't crashed yet. The temperature of the GPU Core on average is around 90-92°C, while the GPU Hot Spot is around 101-103°C. These temperatures match the ones Proxmox has when the VM was under heavy load. They're high, but I don't think they'd be enough to crash Proxmox right?