[SOLVED] Hailo 8 ai m.2 card crashes server when using passthrough.

scyto

Well-Known Member
Aug 8, 2023
572
135
53
Hi all, I have on obscure sceanrio and i have hit a dead end with my google-fu.

  • I have a Hailo 8 M.2 AI co-processor.
  • It is on an M.2 bifurcation card with 3 NVMes, the NVMEs work fine.
  • The co-processor is in its own IOMMU gtoup.
  • It works perfectly on the host (of course this is the last thing i tried, lol).

When i pass through the card to a VM and when the driver loads in the VM the whole physical server instantly resets. in the BMC i see a PCI SERR entry.
I dont seen any signs this is a graveful shutdown, it is a hard reset of the server.

Code:
  ID  |      TimeStamp      |    Sensor Name   |             Sensor Type            |                          Description                           
======|=====================|==================|====================================|================================================================
 736  | 05/19/2025 10:29:23 | BIOS             | critical_interrupt                 | PCIe SEL Log - Asserted
      |                     |                  |                                    | Data1: PCI SERR
      |                     |                  |                                    | Data2: PCI bus number for failed device: 0x00
      |                     |                  |                                    | Data3: PCI device number: 0x01 PCI function number: 0x01

things i have tried:
  • echo null to the reset_methods on the device (this was to supress FLR errors)
  • add a few vfio option to modprobe - namely
  • passing the device as pci instead of pcie
  • changing the PCIE speed in the BIOS of the x16 slot (this was at suggestion of asrock rack who make the mobo)
  • addng viommu=intel to the machine type
  • adding this to modprobe.d file options vfio-pci disable_vfio_pci_flr=1
Due to the sudden nature of the issue there is nothing that seems useful in jornactl on the host or on the guest below is the host.

Code:
May 19 10:27:53 pve-nas1 pvedaemon[2680]: start VM 101: UPID:pve-nas1:00000A78:00004CBB:682B6A19:qmstart:101:root@pam:
May 19 10:27:53 pve-nas1 pvedaemon[2008]: <root@pam> starting task UPID:pve-nas1:00000A78:00004CBB:682B6A19:qmstart:101:root@pam:
May 19 10:27:54 pve-nas1 chronyd[1807]: Selected source 73.65.80.137 (2.debian.pool.ntp.org)

thats the last line, it literally is a super hard reset

is there any magic someone has, or is this just one of those things where it doesn't work and i live with it?
 
I solved this, the device when it load the driver causes a hotplug event to occur

solution is to enable hotplug in the BIOS on the PCIE slot the card is connected to.

one small issue is that the device then disappears from the PCIE bus in the VM 2 seconds after the driver loads