Proxmox and Coral TPU M.2 Passthrough Broken on Newer Platform - PCI_NUM_PINS' failed

Seed

Renowned Member
Oct 18, 2019
116
69
68
125
I got some new hardware. z890 chipset and having trouble passing through the Coral TPU. Ive had no issues doing with with various servers, EPYC Milan, Intel Q670, A lenovo P360 with 8.2.7. I have both m.2 and a A+E key that I kind of experiment with. After I setup proxmox 8.2.7 kernel 6.8.12-3-pve and get the apex driver all setup it all seems ok from the proxmox host perspective. IOMMU is isolated to the device but does have one issue:

Code:
[    0.103529] DMAR: IOMMU enabled
[    0.217422] DMAR-IR: IOAPIC id 2 under DRHD base  0xfc810000 IOMMU 1
[    0.399458] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[    0.453737] DMAR: IOMMU feature sc_support inconsistent

Code:
lspci -nn | grep 089a
01:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]

Code:
ls /dev/apex_0
/dev/apex_0

Code:
89:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
    Subsystem: Global Unichip Corp. Coral Edge TPU
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 16
    IOMMU group: 35
    Region 0: Memory at 4000400000 (64-bit, prefetchable) [size=16K]
    Region 2: Memory at 4000300000 (64-bit, prefetchable) [size=1M]
    Capabilities: [80] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25W
        DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl:    ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x1
            TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
        Vector table: BAR=2 offset=00046800
        PBA: BAR=2 offset=00046068
    Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [f8] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
    Capabilities: [108 v1] Latency Tolerance Reporting
        Max snoop latency: 15728640ns
        Max no snoop latency: 15728640ns
    Capabilities: [110 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=30720ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [200 v2] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Kernel driver in use: apex
    Kernel modules: apex

When I add the pci device, the attempt to start the VM I get the following error:


Code:
kvm: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1

DMESG throws:


Code:
[  287.110888] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  287.112455] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  287.112548] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  288.281835] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  288.283603] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  289.314424] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  319.169012] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  319.169114] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible

And now the device is basically gone til I reboot:

Code:
ls /dev/apex_0
ls: cannot access '/dev/apex_0': No such file or directory

Code:
89:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
    Subsystem: Global Unichip Corp. Coral Edge TPU
    !!! Unknown header type 7f
    Interrupt: pin ? routed to IRQ 16
    IOMMU group: 35
    Region 0: Memory at 4000400000 (64-bit, prefetchable) [size=16K]
    Region 2: Memory at 4000300000 (64-bit, prefetchable) [size=1M]
    Kernel driver in use: vfio-pci
    Kernel modules: apex


Ive tried the following kernels. I expect the older ones to likely fail:

Code:
6.11.0-1-pve
6.8.12-3-pve
6.8.4-2-pve

The Kernel driver in use has changed also. I'm not sure what to make of this. I know this is a newer chipeset. M.2 is on the CPU lanes as well. Maybe it's a bug? If anyone has some tips on how to better debug i'm all for it.
 
Last edited:
if you upgraded to these packages this is why.

libpve-common-perl/stable 8.2.6 all [upgradable from: 8.2.5]
pve-firmware/stable,stable 3.14-1 all [upgradable from: 3.13-3]
qemu-server/stable 8.2.5 amd64 [upgradable from: 8.2.4]

downgrade to the version after the string upgradable from.

I ran into pci passthrough issue as soon as i downgraded worked fine.
 
apt-get install libpve-common-perl=8.2.5
apt-get install pve-firmware=3.13-3
apt-get install qemu-server=8.2.4

I don't know exactly which one fixed it but after running these commands everything works again.
 
apt-get install libpve-common-perl=8.2.5
apt-get install pve-firmware=3.13-3
apt-get install qemu-server=8.2.4

I don't know exactly which one fixed it but after running these commands everything works again.
alright ill give it a go, however 8.2.7 worked fine on a different build with the same m.2 card