Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coral M.2 Accelerator dual edge tpu with Dual Edge TPU Adapter - PCIe x1 Low Profile only one tpu working #53

Open
duindain opened this issue Mar 27, 2024 · 8 comments

Comments

@duindain
Copy link

duindain commented Mar 27, 2024

Hi,

Hoping someone can help diagnose this

I've bought a m.2 dual edge accelerator and an adapter from makerfab

I've just got all the cameras working and running frigate and I'm getting constant reboots of frigate saying it can't find one of the tpus

I'm running frigate in a docker container

dmesg looks like its reporting an error from the adapter/accelerator possibly?

ls -l /dev/apex*
crw-rw---- 1 root apex 120, 0 Mar 25 18:23 /dev/apex_0
crw-rw---- 1 root apex 120, 1 Mar 25 18:23 /dev/apex_1
ls /sys/class/apex/
apex_0  apex_1
dmesg | grep apex
[   35.356036] apex 0000:05:00.0: enabling device (0000 -> 0002)
[   35.371387] apex 0000:06:00.0: enabling device (0000 -> 0002)
[   40.512237] apex 0000:05:00.0: Apex performance not throttled due to temperature
[   48.172225] apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
[   53.312236] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   58.432237] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   63.552236] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   68.672243] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   83.300290] apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
[   83.300301] apex 0000:06:00.0: Error in device open cb: -110
[   83.300315] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   88.384283] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   93.504246] apex 0000:06:00.0: Apex performance not throttled due to temperature
[   98.628059] apex 0000:06:00.0: Apex performance not throttled due to temperature
[  115.671423] apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
[  115.671430] apex 0000:06:00.0: Error in device open cb: -110
[  115.671442] apex 0000:06:00.0: Apex performance not throttled due to temperature
[  120.895350] apex 0000:06:00.0: Apex performance not throttled due to temperature
[  126.015273] apex 0000:06:00.0: Apex performance not throttled due to temperature
[  131.135234] apex 0000:06:00.0: Apex performance not throttled due to temperature

frigate docker compose file

version: "3.9"
services:
  frigate:
    container_name: frigate
    privileged: true # this may not be necessary for all setups
    restart: unless-stopped
    image: ghcr.io/blakeblackshear/frigate:stable
    shm_size: "850mb" # update for your cameras based on calculation above
    devices:
      #- /dev/bus/usb:/dev/bus/usb # passes the USB Coral, needs to be modified for other versions
      - /dev/apex_0:/dev/apex_0 # passes a PCIe Coral, follow driver instructions here https://coral.ai/docs/m2/get-started/#2a-on-linux
      - /dev/apex_1:/dev/apex_1
      #- /dev/dri/renderD128 # for intel hwaccel, needs to be updated for your hardware
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /home/user/Software/Scripts/docker/frigate/config/frigate.yml:/config/config.yml
      - /home/user/Software/Scripts/docker/frigate/config/go2rtc:/config/go2rtc
      - /mnt/CamFootage:/media/frigate
      - /home/user/Software/Scripts/docker/frigate:/db
      - type: tmpfs # Optional: 1GB of memory, reduces SSD/SD Card wear
        target: /tmp/cache
        tmpfs:
          size: 1000000000
    networks:
      - enp8s0
    ports:
      - "5001:5000"
      - "1935:1935" # RTMP feeds
      - "8554:8554" # RTSP feeds
      - "8555:8555/tcp" # WebRTC over tcp
      - "8555:8555/udp" # WebRTC over udp
    environment:
      FRIGATE_RTSP_PASSWORD: "password"

networks:
  enp8s0:

Frigate logs

2024-03-26 01:50:01.995727962  [INFO] Preparing Frigate...
2024-03-26 01:50:01.996292271  [INFO] Starting NGINX...
2024-03-26 01:50:01.998032480  [INFO] Preparing new go2rtc config...
s6-rc: info: service legacy-services successfully started
2024-03-26 01:50:02.002815522  [INFO] Starting Frigate...
2024-03-26 01:50:02.190391428  [INFO] Starting go2rtc...
2024-03-26 01:50:02.232511640  01:50:02.232 INF go2rtc version 1.8.4 linux/amd64
2024-03-26 01:50:02.232851879  01:50:02.232 INF [rtsp] listen addr=:8554
2024-03-26 01:50:02.232879651  01:50:02.232 INF [api] listen addr=:1984
2024-03-26 01:50:02.232996601  01:50:02.232 INF [webrtc] listen addr=:8555
2024-03-26 01:50:02.752193839  [2024-03-26 01:50:02] frigate.app                    INFO    : Starting Frigate (0.13.2-6476f8a)
2024-03-26 01:50:02.824835738  [2024-03-26 01:50:02] peewee_migrate.logs            INFO    : Starting migrations
2024-03-26 01:50:02.827668738  [2024-03-26 01:50:02] peewee_migrate.logs            INFO    : There is nothing to migrate
2024-03-26 01:50:02.833159370  [2024-03-26 01:50:02] frigate.app                    INFO    : Recording process started: 729
2024-03-26 01:50:02.834944512  [2024-03-26 01:50:02] frigate.app                    INFO    : go2rtc process pid: 89
2024-03-26 01:50:02.856538296  [2024-03-26 01:50:02] detector.coral1                INFO    : Starting detection process: 739
2024-03-26 01:50:02.862384796  [2024-03-26 01:50:02] detector.coral2                INFO    : Starting detection process: 744
2024-03-26 01:50:02.863193935  [2024-03-26 01:50:02] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as pci:0
2024-03-26 01:50:02.866746978  [2024-03-26 01:50:02] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found
2024-03-26 01:50:02.866967582  [2024-03-26 01:50:02] frigate.app                    INFO    : Output process started: 761
2024-03-26 01:50:02.882466830  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera1: 768
2024-03-26 01:50:02.888466686  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera2: 770
2024-03-26 01:50:02.894560801  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera3: 771
2024-03-26 01:50:02.900379829  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera4: 773
2024-03-26 01:50:02.906395507  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera5: 776
2024-03-26 01:50:02.918211517  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for camera6: 778
2024-03-26 01:50:02.919196908  [2024-03-26 01:50:02] frigate.app                    INFO    : Camera processor started for doorcam: 781
2024-03-26 01:50:02.925658092  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera1: 783
2024-03-26 01:50:02.931928107  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera2: 789
2024-03-26 01:50:02.938740020  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera3: 795
2024-03-26 01:50:02.944340839  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera4: 800
2024-03-26 01:50:02.950290741  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera5: 806
2024-03-26 01:50:02.957319010  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for camera6: 827
2024-03-26 01:50:02.963700124  [2024-03-26 01:50:02] frigate.app                    INFO    : Capture process started for doorcam: 830
2024-03-26 01:50:11.998523966  [INFO] Starting go2rtc healthcheck service...
2024-03-26 01:50:15.730801593  [2024-03-26 01:50:02] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as pci:1
2024-03-26 01:50:15.730965721  [2024-03-26 01:50:15] frigate.detectors.plugins.edgetpu_tfl ERROR   : No EdgeTPU was detected. If you do not have a Coral device yet, you must configure CPU detectors.
2024-03-26 01:50:15.730982292  Process detector:coral2:
2024-03-26 01:50:15.732237840  Traceback (most recent call last):
2024-03-26 01:50:15.732251045    File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
2024-03-26 01:50:15.732251827      delegate = Delegate(library, options)
2024-03-26 01:50:15.732252678    File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
2024-03-26 01:50:15.732255824      raise ValueError(capture.message)
2024-03-26 01:50:15.732264821  ValueError
2024-03-26 01:50:15.732280571
2024-03-26 01:50:15.732281462  During handling of the above exception, another exception occurred:
2024-03-26 01:50:15.732281993
2024-03-26 01:50:15.732282755  Traceback (most recent call last):
2024-03-26 01:50:15.732303764    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
2024-03-26 01:50:15.732304536      self.run()
2024-03-26 01:50:15.732305658    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
2024-03-26 01:50:15.732307110      self._target(*self._args, **self._kwargs)
2024-03-26 01:50:15.732307882    File "/opt/frigate/frigate/object_detection.py", line 102, in run_detector
2024-03-26 01:50:15.732309405      object_detector = LocalObjectDetector(detector_config=detector_config)
2024-03-26 01:50:15.732328400    File "/opt/frigate/frigate/object_detection.py", line 53, in __init__
2024-03-26 01:50:15.732329753      self.detect_api = create_detector(detector_config)
2024-03-26 01:50:15.732330735    File "/opt/frigate/frigate/detectors/__init__.py", line 18, in create_detector
2024-03-26 01:50:15.732331336      return api(detector_config)
2024-03-26 01:50:15.732332117    File "/opt/frigate/frigate/detectors/plugins/edgetpu_tfl.py", line 41, in __init__
2024-03-26 01:50:15.732332929      edge_tpu_delegate = load_delegate("libedgetpu.so.1.0", device_config)
2024-03-26 01:50:15.732333901    File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
2024-03-26 01:50:15.732364258      raise ValueError('Failed to load delegate from {}\n{}'.format(
2024-03-26 01:50:15.732365560  ValueError: Failed to load delegate from libedgetpu.so.1.0
2024-03-26 01:50:15.732366051
2024-03-26 01:50:23.098926324  [2024-03-26 01:50:23] frigate.watchdog               INFO    : Detection appears to have stopped. Exiting Frigate...
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service nginx: stopping
s6-rc: info: service go2rtc-healthcheck: stopping
2024-03-26 01:50:23.115132162  [INFO] The go2rtc-healthcheck service exited with code 256 (by signal 15)
s6-rc: info: service go2rtc-healthcheck successfully stopped
2024-03-26 01:50:23.252760402  [INFO] Service NGINX exited with code 0 (by signal 0)
s6-rc: info: service nginx successfully stopped
s6-rc: info: service nginx-log: stopping
s6-rc: info: service frigate: stopping
2024-03-26 01:50:23.258859999  [2024-03-26 01:50:23] frigate.app                    INFO    : Stopping...
s6-rc: info: service nginx-log successfully stopped
2024-03-26 01:50:23.259005072  [2024-03-26 01:50:23] root                           INFO    : Waiting for detection process to exit gracefully...
2024-03-26 01:50:23.259067950  [2024-03-26 01:50:23] frigate.stats                  INFO    : Exiting stats emitter...
2024-03-26 01:50:23.259199507  [2024-03-26 01:50:23] frigate.watchdog               INFO    : Exiting watchdog...
2024-03-26 01:50:23.259248820  [2024-03-26 01:50:23] frigate.ptz.autotrack          INFO    : Exiting autotracker...
2024-03-26 01:50:23.259361050  [2024-03-26 01:50:23] frigate.storage                INFO    : Exiting storage maintainer...
2024-03-26 01:50:23.259403901  [2024-03-26 01:50:23] frigate.record.cleanup         INFO    : Exiting recording cleanup...
2024-03-26 01:50:23.259478781  [2024-03-26 01:50:23] frigate.events.cleanup         INFO    : Exiting event cleanup...
2024-03-26 01:50:23.260938894  [2024-03-26 01:50:23] detector.coral1                INFO    : Signal to exit detection process...
2024-03-26 01:50:23.263547072  Fatal Python error: Segmentation fault

If i comment out in the frigate config - /dev/apex_1:/dev/apex_1 and restart frigate container it runs and stops rebooting and dmesg stops reporting
[ 115.671430] apex 0000:06:00.0: Error in device open cb: -110

I've removed the adapter and checked its seated well and no dust and reinserted it to the pci port

CPU: Ryzen 7 5700G
Motherboard: B550M Steel Legend
GPU: Onboard
OS: Linux Mint 21.3 Virginia

@magic-blue-smoke
Copy link
Owner

Hi @duindain
Could you please try/tell:

  • comment out apex_0 instead of apex_1 to see if there's any difference?
  • do you have a heatsink for TPUs?

@duindain
Copy link
Author

duindain commented Apr 7, 2024

In docker passing through any of these works fine individually when the frigate config is only using pcie:0

- /dev/apex_0:/dev/apex_1
- /dev/apex_1:/dev/apex_0
- /dev/apex_1:/dev/apex_1
- /dev/apex_0:/dev/apex_0

If i set the frigate config to use pcie:1 it fails

I don't have a heatsink atm, i can add one

@magic-blue-smoke
Copy link
Owner

@duindain please try it with heatsink, as it's needed anyways. If it won't help, we'll consider adapter replacement

@duindain
Copy link
Author

duindain commented Apr 12, 2024

I've put a passive heat sink on with some thermal joining pad, its definitely not high quality but the case is well ventilated, has a 120mm fan and its fairly cool here atm 14-20c ambient

I'm not sure if this is accurate or how you are meant to check (There didnt seem to be much info out there) but i get this values

When passing through just apex_0 from docker and when passing through both
cat /sys/class/apex/apex_0/temp 48300 in a range so 46-48 degrees c
cat /sys/class/apex/apex_1/temp -89700 this seems to always return this number

I assume the -89700 is because its not being used? or from just not running

I've tried a few combinations but apex_1 always seems to return that -89700 regardless

The temp drops a bit when i configure frigate to use both tpus presumably because its spending all its time rebooting and not actually sending anything to be processed

@magic-blue-smoke
Copy link
Owner

@duindain feels like something's wrong with either TPU card or adapter itself. If you can't inspect flipchips on your TPU card with microscope or try another card, we can try to replace adapter

@duindain
Copy link
Author

@magic-blue-smoke unfortunately the best i have is a magnifying lens and i cant see anything looking broken or badly soldered, I don't have another card to try

@magic-blue-smoke
Copy link
Owner

@duindain we can try adapter board replacement. Could you contact me using a contact form at the bottom of the page?

@duindain
Copy link
Author

duindain commented May 3, 2024

ty, i"ve sent a message with order details and other info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants