Coral TPU (pcie) on Proxmox

edge69

New Member
Jan 9, 2024
5
0
1
I have been trying for hours to get the Coral TPU (install in a pcie slot) operating correctly in Proxmox.

I'm on the lastest dev 6.5.11-7.

I've followed the instructions on the Coral website (and tried many workarounds from google searches!) and am at a loss.

Coral TPU is recognised correctly;

lspci -nn | grep 089a
06:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]

PCIE driver load says its loaded, but I have no /dev/apex;

ls /dev/apex_0
ls: cannot access '/dev/apex_0': No such file or directory

I therefore cannot get the device to work correctly (before then using in a Home Assistant VM).

Output for lspci -vvv;

06:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
Subsystem: Global Unichip Corp. Coral Edge TPU
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 255
IOMMU group: 20
Region 0: Memory at 4012200000 (64-bit, prefetchable) [disabled] [size=16K]
Region 2: Memory at 4012100000 (64-bit, prefetchable) [disabled] [size=1M]
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
Vector table: BAR=2 offset=00046800
PBA: BAR=2 offset=00046068
Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [f8] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: [108 v1] Latency Tolerance Reporting
Max snoop latency: 3145728ns
Max no snoop latency: 3145728ns
Capabilities: [110 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=81920ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [200 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Kernel driver in use: vfio-pci

No apex or gasket drivers appear.
Can anyone help?!
So frustrating!

Thanks
 
working on mine running pve-manager/7.4-17/513c62be, ive not tried on v8 though as python3 version is too high... but if passing through i doubt it matters. Ive found that power usage goes up as C states dont drop lower than C3.

My cheat sheet is as follows.

apt update && apt dist-upgrade -y
apt install wget curl git -y
apt install pve-headers -y
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | tee /etc/apt/sources.list.d/coral-edgetpu.list
mkdir -m 0755 -p /etc/apt/keyrings/
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | tee /etc/apt/keyrings/coral.gpg
apt-get install gasket-dkms libedgetpu1-std -y
sh -c "echo 'SUBSYSTEM==\"apex\", MODE=\"0660\", GROUP=\"apex\"' >> /etc/udev/rules.d/65-apex.rules"
groupadd apex
adduser $USER apex

reboot

lspci -nn | grep 089a
ls /dev/apex_0
lspci -v
apt-get reinstall gasket-dkms libedgetpu1-std

reboot and test

apt-get install python3-pycoral
mkdir coral
cd coral/
git clone https://github.com/google-coral/pycoral.git
cd pycoral
bash examples/install_requirements.sh classify_image.py
python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg

Hope this helps.
 
I am also having trouble getting my Coral PCIe A+E key working in an aliexpress N100 mini PC.

What is very interesting (strange?) is that even without any drivers present (no gasket, no libcoral, no blacklist), the command lspci sometimes hangs for 10-15 seconds before listing pci devices.

When this happens, I see the following in logs:

Code:
[   19.751415] pcieport 0000:00:1d.3: broken device, retraining non-functional downstream link at 2.5GT/s
[   20.755345] pcieport 0000:00:1d.3: retraining failed
[   21.999305] pcieport 0000:00:1d.3: broken device, retraining non-functional downstream link at 2.5GT/s
[   23.003265] pcieport 0000:00:1d.3: retraining failed
[   23.003270] vfio-pci 0000:06:00.0: not ready 1023ms after bus reset; waiting
[   24.047219] vfio-pci 0000:06:00.0: not ready 2047ms after bus reset; waiting
[   26.159132] vfio-pci 0000:06:00.0: not ready 4095ms after bus reset; waiting
[   30.510911] vfio-pci 0000:06:00.0: not ready 8191ms after bus reset; waiting
[   38.958594] vfio-pci 0000:06:00.0: not ready 16383ms after bus reset; waiting
[   56.366006] vfio-pci 0000:06:00.0: not ready 32767ms after bus reset; waiting
[   91.180525] vfio-pci 0000:06:00.0: not ready 65535ms after bus reset; giving up
[   91.253880] vfio-pci 0000:06:00.0: Unable to change power state from D0 to D3hot, device inaccessible

The miniPC otherwise works fine, pcie passtrhough also working fine (already passing through intel igc 2.5 adapters to pfsense). When lspci works, ASPM reports being disabled. ASPM is disabled in device level on all PCIe ports (this is my BIOS default).

After a while, I can only get to this:

Code:
06:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
        Subsystem: Global Unichip Corp. Coral Edge TPU
        !!! Unknown header type 7f
        Interrupt: pin ? routed to IRQ 19
        IOMMU group: 19
        Region 0: Memory at 60e0100000 (64-bit, prefetchable) [size=16K]
        Region 2: Memory at 60e0000000 (64-bit, prefetchable) [size=1M]
        Kernel driver in use: vfio-pci

Also tried options vfio-pci ids=1ac1:089a disable_idle_d3=1, but same result, except the error message is now different:

Code:
Unable to change power state from D3cold to D0, device inaccessible

Is this a hw issue with the coral pcie card? hw issue with the miniPC? bios config issue?
 
The problem in building your own DKMS modules is that they need to get signed on Secure Boot enabled system, otherwise the driver will not load. Today I am spending more time trying to figure out how to do just that.

If anyone has already found the best way, please let us know?
 
I was thinking of moving my Blue Iris VM into Proxmox. Does the Coral PCIe A+E key pass through on Proxmox?
 
Why do you guys want to load apex driver on hypervisor (proxmox) level? I'd assume you'd like to passtrough the coral to the VM? Or is it the case of using the device in the container? In the former you want to avoid loading it.
 
  • Like
Reactions: mcfly9
minor modification to the step that adds the repository:

Code:
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /etc/apt/trusted.gpg.d/coral-edgetpu.gpg

also run
Code:
apt update
after adding the repo and before continuing with trying to install gasket-dkms and libedgetpu1-std
 
Why do you guys want to load apex driver on hypervisor (proxmox) level? I'd assume you'd like to passtrough the coral to the VM? Or is it the case of using the device in the container? In the former you want to avoid loading it.
Because running it within LXC is more power and performance efficient... I transformed all possible VM into LXC and CPU utilisation is barely visible (1-2%), while running plenty VMs it was about 20%. And power draw also significantly drops - from over 100W to 30W on my machine. It is why...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!