I got some new hardware. z890 chipset and having trouble passing through the Coral TPU. Ive had no issues doing with with various servers, EPYC Milan, Intel Q670, A lenovo P360 with 8.2.7. I have both m.2 and a A+E key that I kind of experiment with. After I setup proxmox 8.2.7 kernel 6.8.12-3-pve and get the apex driver all setup it all seems ok from the proxmox host perspective. IOMMU is isolated to the device but does have one issue:
When I add the pci device, the attempt to start the VM I get the following error:
DMESG throws:
And now the device is basically gone til I reboot:
Ive tried the following kernels. I expect the older ones to likely fail:
The Kernel driver in use has changed also. I'm not sure what to make of this. I know this is a newer chipeset. M.2 is on the CPU lanes as well. Maybe it's a bug? If anyone has some tips on how to better debug i'm all for it.
Code:
[ 0.103529] DMAR: IOMMU enabled
[ 0.217422] DMAR-IR: IOAPIC id 2 under DRHD base 0xfc810000 IOMMU 1
[ 0.399458] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[ 0.453737] DMAR: IOMMU feature sc_support inconsistent
Code:
lspci -nn | grep 089a
01:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
Code:
ls /dev/apex_0
/dev/apex_0
Code:
89:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
Subsystem: Global Unichip Corp. Coral Edge TPU
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
IOMMU group: 35
Region 0: Memory at 4000400000 (64-bit, prefetchable) [size=16K]
Region 2: Memory at 4000300000 (64-bit, prefetchable) [size=1M]
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=2 offset=00046800
PBA: BAR=2 offset=00046068
Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [f8] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: [108 v1] Latency Tolerance Reporting
Max snoop latency: 15728640ns
Max no snoop latency: 15728640ns
Capabilities: [110 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=30720ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [200 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Kernel driver in use: apex
Kernel modules: apex
When I add the pci device, the attempt to start the VM I get the following error:
Code:
kvm: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1
DMESG throws:
Code:
[ 287.110888] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 287.112455] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 287.112548] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 288.281835] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 288.283603] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 289.314424] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 319.169012] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 319.169114] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
And now the device is basically gone til I reboot:
Code:
ls /dev/apex_0
ls: cannot access '/dev/apex_0': No such file or directory
Code:
89:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
Subsystem: Global Unichip Corp. Coral Edge TPU
!!! Unknown header type 7f
Interrupt: pin ? routed to IRQ 16
IOMMU group: 35
Region 0: Memory at 4000400000 (64-bit, prefetchable) [size=16K]
Region 2: Memory at 4000300000 (64-bit, prefetchable) [size=1M]
Kernel driver in use: vfio-pci
Kernel modules: apex
Ive tried the following kernels. I expect the older ones to likely fail:
Code:
6.11.0-1-pve
6.8.12-3-pve
6.8.4-2-pve
The Kernel driver in use has changed also. I'm not sure what to make of this. I know this is a newer chipeset. M.2 is on the CPU lanes as well. Maybe it's a bug? If anyone has some tips on how to better debug i'm all for it.
Last edited: