iGPU passthrough to VM using GVT-g causes VM to eventually crash

Unspec

New Member
Aug 28, 2024
25
1
3
Since updating to Proxmox 9, my Debian 13 VM that has a iGPU passthrough for Frigate hardware acceleration seems to consistently crash after a few hours, even if Frigate is stopped (e.g. GPU isn't actually doing anything). I've removed the passthrough for now to avoid crashes, but could use some help diagnosing. All packages on both VM and host are up to date. HW accelerated transcoding for a Plex LXC works just fine.

System information:

Code:
CPU: 8 x Intel(R) Xeon(R) CPU E3-1245 v6 @ 3.70GHz (1 Socket)
Kernel Version: Linux 6.14.11-1-pve
Boot Mode: EFI (Secure Boot)
Manager Version: pve-manager/9.0.6/49c767b70aeb6648

VM config:

Code:
agent: 1,fstrim_cloned_disks=1
args: -cpu host,+vmx -global intel-iommu.aw-bits=39
bios: ovmf
boot: order=scsi0
cores: 4
cpu: host,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+hv-evmcs;+aes
efidisk0: local-zfs:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci1: mapping=Coral,pcie=1
hotplug: disk,network,usb
localtime: 1
machine: q35,viommu=intel
memory: 8192
meta: creation-qemu=9.0.2,ctime=1727379557
name: debian
net0: virtio=02:DC:0F:00:27:D2,bridge=vmbr20,queues=4
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-103-disk-1,discard=on,iothread=1,size=40G,ssd=1
scsi1: nvr:vm-103-disk-0,backup=0,discard=on,iothread=1,size=488282M
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=bd744d84-034d-41a5-af69-5f08a58d8a8b
sockets: 1
startup: order=98
tablet: 0
tags: 192.168.20.6
usb0: mapping=Sonoff_Zigbee
hostpci0: mapping=Virt-iGPU,mdev=i915-GVTg_V5_8
vmgenid: 6188e861-4196-4955-b03f-cc019be30dda

VM journalctl around time of crash:

Code:
Aug 30 00:40:46 debian kernel: i915 0000:06:10.0: [drm] GPU HANG: ecode 9:4:fffffffe
Aug 30 00:40:46 debian kernel: i915 0000:06:10.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0
Aug 30 00:40:46 debian kernel: i915 0000:06:10.0: [drm] Got hung context on rcs0 with active request 15:2 not yet started
Aug 30 00:40:47 debian kernel: i915 0000:06:10.0: [drm] Got hung context on vcs0 with active request 16:2 not yet started
Aug 30 00:40:47 debian kernel: i915 0000:06:10.0: [drm] GPU HANG: ecode 9:1:fffffffe
Aug 30 00:40:47 debian kernel: i915 0000:06:10.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Aug 30 00:41:05 debian kernel: i915 0000:06:10.0: [drm] Got hung context on rcs0 with active request 15:2 not yet started
Aug 30 00:41:12 debian kernel: i915 0000:06:10.0: [drm] Got hung context on vcs0 with active request 16:2 not yet started
Aug 30 00:41:12 debian kernel: i915 0000:06:10.0: [drm] GPU HANG: ecode 9:4:fffffffe
Aug 30 00:41:12 debian kernel: i915 0000:06:10.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0
Aug 30 00:41:12 debian kernel: i915 0000:06:10.0: [drm] Got hung context on rcs0 with active request 15:2 not yet started
Aug 30 00:41:14 debian kernel: i915 0000:06:10.0: [drm] Got hung context on vcs0 with active request 16:2 not yet started

On the host side, the journalctl when grep'd for "i915|drm|gvt|mdev" shows these kinds of errors, which typically repeat themselves for quite a few times:

Code:
Aug 30 00:33:54 pve kernel: gvt: guest page write error, gpa 45ffbf76
Aug 30 00:33:54 pve kernel: gvt: guest page write error, gpa 45ffbf77
Aug 30 00:33:54 pve kernel: gvt: guest page write error, gpa 45ffbf78

...etc

Aug 30 00:34:19 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xff77e000 type 12
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: fail to populate guest root pointer
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: failed to shadow ppgtt mm
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: fail to create mm
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: failed to submit desc 1
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: fail submit workload on ring vcs0
Aug 30 00:34:19 pve kernel: gvt: vgpu 1: fail to emulate MMIO write 00012230 len 4
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xff4ff000 type 12
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: fail to populate guest root pointer
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: failed to shadow ppgtt mm
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: fail to create mm
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: failed to submit desc 1
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: fail submit workload on ring rcs0
Aug 30 00:34:21 pve kernel: gvt: vgpu 1: fail to emulate MMIO write 00002230 len 4
Aug 30 00:34:46 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xff77e000 type 12
Aug 30 00:34:46 pve kernel: gvt: vgpu 1: fail to populate guest root pointer
Aug 30 00:34:46 pve kernel: gvt: vgpu 1: failed to shadow ppgtt mm
Aug 30 00:34:46 pve kernel: gvt: vgpu 1: fail to create mm

...etc

Aug 30 00:41:37 pve kernel: WARNING: CPU: 2 PID: 18766 at drivers/gpu/drm/i915/gvt/gtt.c:316 gtt_get_entry64+0xd4/0xe0 [kvmgt]
Aug 30 00:41:37 pve kernel: Modules linked in: bluetooth tcp_diag inet_diag wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_compat nft_chain_nat cfg80211 xt_MASQUERADE xt_tcpudp xt_mark veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw 8021q garp mrp bonding tls nf_tables ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 qrtr nf_defrag_ipv4 iptable_filter sunrpc binfmt_misc nfnetlink_log sch_fq_codel vhost_net vhost vhost_iotlb tap snd_hda_codec_hdmi drivetemp snd_ctl_led kvmgt mdev snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_soc_avs snd_soc_hda_codec snd_hda_ext_core intel_tcc_cooling snd_soc_core x86_pkg_temp_thermal snd_compress intel_powerclamp ac97_bus coretemp snd_pcm_dmaengine dell_pc mei_hdcp mei_pxp kvm_intel platform_profile snd_hda_intel polyval_clmulni dell_wmi
Aug 30 00:41:37 pve kernel:  polyval_generic snd_intel_dspcfg dell_smm_hwmon snd_intel_sdw_acpi ghash_clmulni_intel sha256_ssse3 sha1_ssse3 snd_hda_codec aesni_intel crypto_simd snd_hda_core cryptd snd_hwdep dell_smbios snd_pcm rapl dcdbas intel_pmc_core snd_timer intel_cstate sparse_keymap dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt pcspkr pmt_telemetry snd mei_me input_leds pmt_class ee1004 soundcore cdc_acm mei intel_pch_thermal ie31200_edac acpi_pad intel_vsec mac_hid i915 drm_buddy ttm drm_display_helper cec rc_core kvm vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd msr nvme_fabrics nvme_keyring efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_generic usbmouse usbkbd usbhid hid xhci_pci nvme i2c_i801 igb i2c_mux e1000e xhci_hcd nvme_core ahci i2c_algo_bit i2c_smbus dca libahci nvme_auth video wmi
Aug 30 00:41:37 pve kernel: WARNING: CPU: 2 PID: 18766 at drivers/gpu/drm/i915/gvt/gtt.c:316 gtt_get_entry64+0xd4/0xe0 [kvmgt]
Aug 30 00:41:37 pve kernel: Modules linked in: bluetooth tcp_diag inet_diag wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_compat nft_chain_nat cfg80211 xt_MASQUERADE xt_tcpudp xt_mark veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw 8021q garp mrp bonding tls nf_tables ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 qrtr nf_defrag_ipv4 iptable_filter sunrpc binfmt_misc nfnetlink_log sch_fq_codel vhost_net vhost vhost_iotlb tap snd_hda_codec_hdmi drivetemp snd_ctl_led kvmgt mdev snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_soc_avs snd_soc_hda_codec snd_hda_ext_core intel_tcc_cooling snd_soc_core x86_pkg_temp_thermal snd_compress intel_powerclamp ac97_bus coretemp snd_pcm_dmaengine dell_pc mei_hdcp mei_pxp kvm_intel platform_profile snd_hda_intel polyval_clmulni dell_wmi

...etc
 
Last edited: