iGPU passthrough attempt crashes host kernel

markc

Member
Sep 12, 2020
39
5
13
69
Gold Coast, Australia
spiderweb.com.au
PVE 8.1.3 host on a Minisforum UM780 XTX (AMD Ryzen 7 7840HS w/ Radeon 780M Graphics) with a fairly standard iGPU passthrough setup. The guest is Manjaro/KDE w/ BIOS OVMF, Display none, Machine q35, hostpci 0000:c5:00.0,pcie=1 (+ rombar and all functions). If I enable Display SPICE and disable hostpci then the guest VM works fine, but when enabling the iGPU PCI Device I get this kernel crash which forces the host machine to reboot. I've seen lots of errors trying to get iGPU passthrough to work on 1/2 dozen machines now but I've never seen a total kernel crash like this before... any suggestions?

Code:
Jan 07 15:16:57 pve5 pvedaemon[2490]: start VM 102: UPID:pve5:000009BA:000027A2:659A33C9:qmstart:102:root@pam:
Jan 07 15:16:57 pve5 pvedaemon[1249]: <root@pam> starting task UPID:pve5:000009BA:000027A2:659A33C9:qmstart:102:root@pam:
Jan 07 15:16:57 pve5 kernel: xhci_hcd 0000:c5:00.3: remove, state 4
Jan 07 15:16:57 pve5 kernel: usb usb2: USB disconnect, device number 1
Jan 07 15:16:57 pve5 kernel: usb 2-1: USB disconnect, device number 2
Jan 07 15:16:57 pve5 kernel: usb 2-2: USB disconnect, device number 3
Jan 07 15:16:57 pve5 kernel: xhci_hcd 0000:c5:00.3: USB bus 2 deregistered
Jan 07 15:16:57 pve5 kernel: xhci_hcd 0000:c5:00.3: remove, state 1
Jan 07 15:16:57 pve5 kernel: usb usb1: USB disconnect, device number 1
Jan 07 15:16:57 pve5 kernel: usb 1-1: USB disconnect, device number 2
Jan 07 15:16:57 pve5 kernel: usb 1-1.1: USB disconnect, device number 4
Jan 07 15:16:57 pve5 kernel: usb 1-2: USB disconnect, device number 3
Jan 07 15:16:57 pve5 kernel: usb 1-5: USB disconnect, device number 5
Jan 07 15:16:57 pve5 kernel: xhci_hcd 0000:c5:00.3: USB bus 1 deregistered
Jan 07 15:16:57 pve5 systemd[1]: Starting systemd-rfkill.service - Load/Save RF Kill Switch Status...
Jan 07 15:16:57 pve5 systemd[1]: Stopped target bluetooth.target - Bluetooth Support.
Jan 07 15:16:57 pve5 systemd[1]: Started systemd-rfkill.service - Load/Save RF Kill Switch Status.
Jan 07 15:16:58 pve5 kernel: xhci_hcd 0000:c5:00.4: remove, state 4
Jan 07 15:16:58 pve5 kernel: usb usb4: USB disconnect, device number 1
Jan 07 15:16:58 pve5 kernel: xhci_hcd 0000:c5:00.4: USB bus 4 deregistered
Jan 07 15:16:58 pve5 kernel: xhci_hcd 0000:c5:00.4: remove, state 4
Jan 07 15:16:58 pve5 kernel: usb usb3: USB disconnect, device number 1
Jan 07 15:16:58 pve5 kernel: xhci_hcd 0000:c5:00.4: USB bus 3 deregistered
Jan 07 15:16:58 pve5 systemd[1]: Stopped target sound.target - Sound Card.
Jan 07 15:16:58 pve5 kernel: ------------[ cut here ]------------
Jan 07 15:16:58 pve5 kernel: remove_proc_entry: removing non-empty directory 'irq/111', leaking at least 'ACP_PCI_IRQ'
Jan 07 15:16:58 pve5 kernel: WARNING: CPU: 13 PID: 2490 at fs/proc/generic.c:717 remove_proc_entry+0x1b4/0x1e0
Jan 07 15:16:58 pve5 kernel: Modules linked in: ceph libceph fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_sof_amd_rembrandt snd_sof_amd_renoir intel_rapl_msr snd_sof_amd_acp intel_rapl_common snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic edac_mce_amd snd_compress ledtrig_audio kvm_amd ac97_bus snd_hda_intel snd_pcm_dmaengine kvm snd_intel_dspcfg snd_pci_ps btusb crct10dif_pclmul snd_intel_sdw_acpi snd_rpl_pci_acp6x btrtl polyval_clmulni iwlmvm snd_hda_codec snd_acp_pci btbcm polyval_generic snd_hda_core snd_pci_acp6x mac80211 btintel ghash_clmulni_intel snd_hwdep snd_pci_acp5x libarc4 btmtk vhost_net aesni_intel snd_pcm snd_rn_pci_acp3x vhost bluetooth iwlwifi crypto_simd snd_timer snd_acp_config vhost_iotlb cryptd ecdh_generic cfg80211 snd snd_soc_acpi tap input_leds rapl
Jan 07 15:16:58 pve5 kernel:  pcspkr ecc k10temp ccp snd_pci_acp3x soundcore amd_pmc serio_raw mac_hid vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 hid_generic usbkbd usbhid zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb xhci_pci nvme xhci_pci_renesas i2c_hid_acpi r8169 nvme_core video thunderbolt xhci_hcd i2c_hid crc32_pclmul i2c_piix4 realtek nvme_common wmi hid
Jan 07 15:16:58 pve5 kernel: CPU: 13 PID: 2490 Comm: task UPID:pve5: Tainted: P           O       6.5.11-7-pve #1
Jan 07 15:16:58 pve5 kernel: Hardware name: Micro Computer (HK) Tech Limited Venus series/F7BSD, BIOS 1.04 11/15/2023
Jan 07 15:16:58 pve5 kernel: RIP: 0010:remove_proc_entry+0x1b4/0x1e0
Jan 07 15:16:58 pve5 kernel: Code: 90 78 ff ff ff 48 0f 45 c2 49 8b 57 f0 48 89 f1 48 c7 c6 40 1b 65 a7 48 8b 92 a0 00 00 00 4c 8b 80 a0 00 00 00 e8 0c 0d bc ff <0f> 0b e9 64 ff ff ff 49 8b 77 18 48 c7 c7 18 5c b7 a7 e8 f5 0c bc
Jan 07 15:16:58 pve5 kernel: RSP: 0018:ffffb67499dbba68 EFLAGS: 00010246
Jan 07 15:16:58 pve5 kernel: RAX: 0000000000000000 RBX: ffff9e07c6a12a80 RCX: 0000000000000000
Jan 07 15:16:58 pve5 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 07 15:16:58 pve5 kernel: RBP: ffffb67499dbbab0 R08: 0000000000000000 R09: 0000000000000000
Jan 07 15:16:58 pve5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9e07c6a12b00
Jan 07 15:16:58 pve5 kernel: R13: ffffb67499dbbac6 R14: ffffb67499dbbac6 R15: ffff9e07c6a12b08
Jan 07 15:16:58 pve5 kernel: FS:  00007f0547c11b80(0000) GS:ffff9e1e40340000(0000) knlGS:0000000000000000
Jan 07 15:16:58 pve5 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 15:16:58 pve5 kernel: CR2: 0000562f4b41a6f0 CR3: 0000000179e16000 CR4: 0000000000750ee0
Jan 07 15:16:58 pve5 kernel: PKRU: 55555554
Jan 07 15:16:58 pve5 kernel: Call Trace:
Jan 07 15:16:58 pve5 kernel:  <TASK>
Jan 07 15:16:58 pve5 kernel:  ? show_regs+0x6d/0x80
Jan 07 15:16:58 pve5 kernel:  ? __warn+0x89/0x160
Jan 07 15:16:58 pve5 kernel:  ? remove_proc_entry+0x1b4/0x1e0
Jan 07 15:16:58 pve5 kernel:  ? report_bug+0x17e/0x1b0
Jan 07 15:16:58 pve5 kernel:  ? handle_bug+0x46/0x90
Jan 07 15:16:58 pve5 kernel:  ? exc_invalid_op+0x18/0x80
Jan 07 15:16:58 pve5 kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jan 07 15:16:58 pve5 kernel:  ? remove_proc_entry+0x1b4/0x1e0
Jan 07 15:16:58 pve5 kernel:  ? remove_proc_entry+0x1b4/0x1e0
Jan 07 15:16:58 pve5 kernel:  unregister_irq_proc+0xf2/0x120
Jan 07 15:16:58 pve5 kernel:  free_desc+0x41/0xe0
Jan 07 15:16:58 pve5 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jan 07 15:16:58 pve5 kernel:  ? __kmem_cache_free+0x306/0x350
Jan 07 15:16:58 pve5 kernel:  irq_free_descs+0x52/0x80
Jan 07 15:16:58 pve5 kernel:  irq_domain_free_irqs+0x150/0x1c0
Jan 07 15:16:58 pve5 kernel:  mp_unmap_irq+0x8e/0x90
Jan 07 15:16:58 pve5 kernel:  acpi_unregister_gsi_ioapic+0x2e/0x50
Jan 07 15:16:58 pve5 kernel:  acpi_unregister_gsi+0x17/0x30
Jan 07 15:16:58 pve5 kernel:  acpi_pci_irq_disable+0x7b/0xd0
Jan 07 15:16:58 pve5 kernel:  pcibios_disable_device+0x20/0x40
Jan 07 15:16:58 pve5 kernel:  do_pci_disable_device+0x45/0x90
Jan 07 15:16:58 pve5 kernel:  pci_disable_device+0xd3/0xf0
Jan 07 15:16:58 pve5 kernel:  snd_acp63_remove+0x95/0xd0 [snd_pci_ps]
Jan 07 15:16:58 pve5 kernel:  pci_device_remove+0x36/0xb0
Jan 07 15:16:58 pve5 kernel:  device_remove+0x40/0x80
Jan 07 15:16:58 pve5 kernel:  device_release_driver_internal+0x20b/0x270
Jan 07 15:16:58 pve5 kernel:  ? bus_find_device+0xb8/0xf0
Jan 07 15:16:58 pve5 kernel:  device_driver_detach+0x14/0x20
Jan 07 15:16:58 pve5 kernel:  unbind_store+0xac/0xc0
Jan 07 15:16:58 pve5 kernel:  drv_attr_store+0x21/0x50
Jan 07 15:16:58 pve5 kernel:  sysfs_kf_write+0x3b/0x60
Jan 07 15:16:58 pve5 kernel:  kernfs_fop_write_iter+0x130/0x210
Jan 07 15:16:58 pve5 kernel:  vfs_write+0x251/0x440
Jan 07 15:16:58 pve5 kernel:  ksys_write+0x73/0x100
Jan 07 15:16:58 pve5 kernel:  __x64_sys_write+0x19/0x30
Jan 07 15:16:58 pve5 kernel:  do_syscall_64+0x58/0x90
Jan 07 15:16:58 pve5 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jan 07 15:16:58 pve5 kernel:  ? do_syscall_64+0x67/0x90
Jan 07 15:16:58 pve5 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Jan 07 15:16:58 pve5 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jan 07 15:16:58 pve5 kernel:  ? do_syscall_64+0x67/0x90
Jan 07 15:16:58 pve5 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jan 07 15:16:58 pve5 kernel:  ? do_syscall_64+0x67/0x90
Jan 07 15:16:58 pve5 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Jan 07 15:16:58 pve5 kernel: RIP: 0033:0x7f0547d47140
Jan 07 15:16:58 pve5 kernel: Code: 40 00 48 8b 15 c1 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
Jan 07 15:16:58 pve5 kernel: RSP: 002b:00007fff3df76928 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
Jan 07 15:16:58 pve5 kernel: RAX: ffffffffffffffda RBX: 0000562f4ae792a0 RCX: 00007f0547d47140
Jan 07 15:16:58 pve5 kernel: RDX: 000000000000000c RSI: 0000562f52de0730 RDI: 000000000000000d
Jan 07 15:16:58 pve5 kernel: RBP: 0000562f52de0730 R08: 0000000000000000 R09: 0000562f52dbb7d0
Jan 07 15:16:58 pve5 kernel: R10: 0000562f4f98e400 R11: 0000000000000202 R12: 000000000000000c
Jan 07 15:16:58 pve5 kernel: R13: 0000562f4ae792a0 R14: 000000000000000d R15: 0000562f52dddd40
Jan 07 15:16:58 pve5 kernel:  </TASK>
Jan 07 15:16:58 pve5 kernel: ---[ end trace 0000000000000000 ]---
Jan 07 15:16:58 pve5 systemd[1]: Created slice qemu.slice - Slice /qemu.
Jan 07 15:16:58 pve5 systemd[1]: Started 102.scope.
Jan 07 15:16:58 pve5 kernel: tap102i0: entered promiscuous mode
Jan 07 15:16:58 pve5 kernel: vmbr0: port 2(tap102i0) entered blocking state
Jan 07 15:16:58 pve5 kernel: vmbr0: port 2(tap102i0) entered disabled state
Jan 07 15:16:58 pve5 kernel: tap102i0: entered allmulticast mode
Jan 07 15:16:58 pve5 kernel: vmbr0: port 2(tap102i0) entered blocking state
Jan 07 15:16:58 pve5 kernel: vmbr0: port 2(tap102i0) entered forwarding state
Jan 07 15:16:59 pve5 kernel: vfio-pci 0000:c5:00.0: enabling device (0002 -> 0003)
Jan 07 15:16:59 pve5 kernel: vfio-pci 0000:c5:00.1: enabling device (0000 -> 0002)
Jan 07 15:17:00 pve5 pvedaemon[1249]: <root@pam> end task UPID:pve5:000009BA:000027A2:659A33C9:qmstart:102:root@pam: OK
client_loop: send disconnect: Broken pipe
 
Just to be clear, when I add a PCI Device for the iGPU and start this Linux based guest VM, the HOST kernel crashes and takes down all VM/CTs. If this VM has autostart turned on, then it's just an endless cycle of death. When I remove that PCI Device and just rely on SPICE then all is well. Here are the related settings for this VM that work okay without the PCI Device...

Code:
pve5 ~ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.5.11-7-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet amd_iommu=on iommu=pt pci=nommconf video=efifb:off initcall_blacklist=sysfb_init textonly pcie_acs_override=downstream,multifunction
Code:
pve5 ~ cat /etc/modprobe.d/blacklist.conf
blacklist amdgpu
Code:
pve5 ~ cat /etc/modprobe.d/vfio-vga.conf
options vfio-pci ids=1002:15bf,1002:1640 disable_vga=1
Code:
pve5 ~ cat /etc/pve/nodes/pve5/qemu-server/102.conf
agent: 1
audio0: device=ich9-intel-hda,driver=spice
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 8
cpu: host
efidisk0: local-zfs:vm-102-disk-0,efitype=4m,size=1M
machine: q35
memory: 16384
meta: creation-qemu=7.0.0,ctime=1662183539
name: mangpu
net0: virtio=BC:24:11:13:79:AC,bridge=vmbr0
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=ab874a54-392b-4942-a2e9-27de027417a1
sockets: 1
vga: virtio,memory=256
virtio0: local-zfs:vm-102-disk-1,cache=writeback,discard=on,size=24G
vmgenid: 9e03ba56-8fa4-44d4-8f7f-0c046527c685
 
I have Minisforum bd770i motherboard with onboard CPU AMD Ryzen 7745hx and iGPU 610. I tried 5 guides already. So far non of them work (some old ones break configurations of Proxmox 8.1 and Debian). I managed to switch drivers, softdep modules, apply noreset and vbios patches.

Without passthrough VM works on generic drivers in OS. If I put PCI device to VM config - host pc reboot. If I put PCI device and add a number of options like romfile, I get config error in logs and VM start in without PCI device.

So far I can't find a pill for this issue. May be external GPU (connected to PCI-E slot or some kind of Oculink option) will work, but so far I don't have one and my nas case don't provide any option.
 
  • Like
Reactions: markc
So you have a similar CPU and iGPU and couldn't get it to work either. That's not very encouraging. There seems to be some extra IOMMU and SRV-IO in the 6.8 kernel so hopefully Proxmox will release an update or patched kernel soon and we can try yet again. I'm not sure of the practical difference between full iGPU passthrough and SRV-IO because I've never managed to get either one to work on any minipc/nas hardware in the last 3 years but maybe SRV-IO is an acceptable solution. I just want decent gfx performance for a desktop VM as a daily driver.
 
Did you just pass through the graphics card? I also have the same problem with the same device, but when I pass through the graphics card alone, it is normal, but when I pass through the sound card, the host crashes. Later, I found that when the sound card was directly connected to 0000:c5:00.1, it was also connected to 0000:c5:00.5, and everything was normal.
You can refer here, https://github.com/isc30/ryzen-7000-series-proxmox/issues/16
Sorry, my English is not good, I used Google Translate.
 
I was successful with passthrough.
I activate IOMMU in bios. Add parameter " iommu=pt" to grub. Make softdep for vfio-pci without blacklisting gpu drivers. Additionally you have to passthrough audio device that connected with hdmi. In my case 07:00.0 - gpu, 07:00.1 - audio device. as result you have vm connected to both devices. In addition you have to dump bios from igpu and use custom amdgopbios.bin for audio device, because without it gpu and audio don't initialise on vm. This step is essential, without this VM's with UEFI boot don't init iGPU.
In lspci active module should be vfio-pci and amdgpu as kernel module. without this than If you start VM you host will restart.

One final touch - you have to use full version of adrenaline drivers (about 700mb), because autodetect option (about 46 mb) report that no and hardware detected, and don't install.

Small problem while using iGPU passthrough. AMD gpu all have issue to detach from VM, if you restart igpu will not start and you get error 43 in windows. I tried host vm reset service and host reset service. Sometimes it reset ok, but in most cases I have to restart host.

If you plan to do a small gaming, than you have to install sunshine streaming or parsec. Only caveat that 610M don't support many hardware encoding options (no HEVC, no AV1, no HDR) and as result to stream you have to use wider bandwidth and higher bitrate. I propose to adjust bandwidth that to max out and 60fps.

I don't install jellyfish as external encoding service. So if you try it, write about your result.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!