Indexify - Extraction and Retrieval from Videos, PDF and Audio for Interactive AI Applications

LLM applications backed by Indexify will never answer outdated information.

Indexify is an open-source engine for buidling fast data pipelines for unstructured data(video, audio, images and documents) using re-usable extractors for embedding, transformation and feature extraction. LLM Applications can query transformed content friendly to LLMs by semantic search and SQL queries.

Indexify keeps vectordbs, structured databases(postgres) updated by automatically invoking the pipelines as new data is ingested into the system from external data sources.

Why use Indexify

Makes Unstructured Data Queryable with SQL and Semantic Search
Real Time Extraction Engine to keep indexes automatically updated as new data is ingested.
Create Extraction Graph to describe data transformation and extraction of embedding and structured extraction.
Incremental Extraction and Selective Deletion when content is deleted or updated.
Extractor SDK allows adding new extraction capabilities, and many readily available extractors for PDF, Image and Video indexing and extraction.
Works with any LLM Framework including Langchain, DSPy, etc.
Runs on your laptop during prototyping and also scales to 1000s of machines on the cloud.
Works with many Blob Stores, Vector Stores and Structured Databases
We have even Open Sourced Automation to deploy to Kubernetes in production.

Detailed Getting Started

To get started follow our documentation.

Quick Start

Download and start Indexify

curl https://getindexify.ai | sh
./indexify server -d

Install the Indexify Extractor and Client SDKs

virtualenv ve
source ve/bin/activate
pip install indexify indexify-extractor-sdk

Download some extractors

indexify-extractor download hub://embedding/minilm-l6
indexify-extractor download hub://pdf/pdf-extractor
indexify-extractor download hub://image/yolo
indexify-extractor download hub://text/chunking
indexify-extractor download hub://audio/whisper-asr
indexify-extractor join-server

Basic RAG

This example shows how to implement RAG on text

Create an Extraction Graph

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

extraction_graph_spec = """
name: 'sportsknowledgebase'
extraction_policies:
   - extractor: 'tensorlake/minilm-l6'
     name: 'minilml6'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph) 
print("indexes", client.indexes())

Add Texts

client.add_documents("sportsknowledgebase", ["Adam Silver is the NBA Commissioner", "Roger Goodell is the NFL commisioner"])

Retrieve

context = client.search_index(name="sportsknowledgebase.minilml6.embedding", query="NBA commissioner", top_k=1)

Podcast Summarization and Embedding

This example shows how to transcribe audio, and create a pipeline that embeds the transcription More details about Audio Use Cases - https://docs.getindexify.ai/usecases/audio_extraction/

Create an Extraction Graph

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

extraction_graph_spec = """
name: 'audiosummary'
extraction_policies:
   - extractor: 'tensorlake/asrdiarization'
     name: 'asrextractor'
   - extractor: 'tensorlake/summarization'
     name: 'summarizer'
     input_params:
        max_length: int = 400
        min_length: int = 300
        chunk_method: str = 'recursive'
     content_source: 'asrextractor'
   - extractor: 'tensorlake/minilm-l6'
     name: 'minilml6'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Upload an Audio

with open("sample.mp3", 'wb') as file:
  file.write((requests.get("https://extractor-files.diptanu-6d5.workers.dev/sample-000009.mp3")).content)
content_id = client.upload_file("audiosummary", "sample.mp3")

Adding Texts and Files can be a time consuming process and by default we allow asynchronous ingestion for parallel operations. However the following codes might fail until the extraction has been completed. To make it a blocking call, use client.wait_for_extraction(content_id) after getting the content_id from above.

Retrieve Summary

client.get_extracted_content(content_id)

Search Transcription Index

context = client.search_index(name="audiosummary.minilml6.embedding", query="President of America", top_k=1)

Object Detection on Images

This example shows how to create a pipeline that performs object detection on images using the Yolo extractor. More details about Image understanding and retrieval - https://docs.getindexify.ai/usecases/image_retrieval/

Create an Extraction Graph

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

extraction_graph_spec = """
name: 'imageknowledgebase'
extraction_policies:
   - extractor: 'tensorlake/yolo-extractor'
     name: 'object_detection'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Upload Images

with open("sample.jpg", 'wb') as file:
  file.write((requests.get("https://extractor-files.diptanu-6d5.workers.dev/people-standing.jpg")).content)
content_id = client.upload_file("imageknowledgebase", "sample.jpg")

Retrieve Features of an Image

client.get_extracted_content(content_id)

Query using SQL

result = client.sql_query("select * from ingestion where object_name='skateboard';")

PDF Extraction and Retrieval

This example shows how to create a pipeline that extracts from PDF documents. More information here - https://docs.getindexify.ai/usecases/pdf_extraction/

Create an Extraction Graph

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
   - extractor: 'tensorlake/pdf-extractor'
     name: 'docextractor'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Upload a Document

with open("sample.pdf", 'wb') as file:
  file.write((requests.get("https://extractor-files.diptanu-6d5.workers.dev/scientific-paper-example.pdf")).content)
content_id = client.upload_file("pdfqa", "sample.pdf")

Get Text, Image and Tables

client.get_extracted_content(content_id)

LLM Framework Integration

Indexify can work with any LLM framework, or with your applications directly. We have an example of a Langchain application here and DSPy here.

Try out other extractors

We have a ton of other extractors, you can list them and try them out -

indexify-extractor list

Custom Extractors

Any extraction or transformation algorithm can be expressed as an Indexify Extractor. We provide an SDK to write your own. Please follow the docs here for instructions.

Structured Data

Extractors which produce structured data from content, such as bounding boxes and object type, or line items of invoices are stored in structured store. You can query extracted structured data using Indexify's SQL interface.

We have an example here

Contributions

Please open an issue to discuss new features, or join our Discord group. Contributions are welcome, there are a bunch of open tasks we could use help with!

If you want to contribute on the Rust codebase, please read the developer readme.

Name		Name	Last commit message	Last commit date
Latest commit History 1,724 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.repo/conf		.repo/conf
crates		crates
dockerfiles		dockerfiles
docs		docs
grafana		grafana
protos		protos
scripts		scripts
src		src
templates		templates
ui		ui
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
askama.toml		askama.toml
build.rs		build.rs
client_cert_config		client_cert_config
docker-compose.yaml		docker-compose.yaml
local_server_config.yaml		local_server_config.yaml
package-lock.json		package-lock.json
run_tests.sh		run_tests.sh
rustfmt.toml		rustfmt.toml
sample_config.yaml		sample_config.yaml

License

tensorlakeai/indexify

Folders and files

Latest commit

History

Repository files navigation

Indexify - Extraction and Retrieval from Videos, PDF and Audio for Interactive AI Applications

Why use Indexify

Detailed Getting Started

Quick Start

Download and start Indexify

Install the Indexify Extractor and Client SDKs

Download some extractors

Basic RAG

Create an Extraction Graph

Add Texts

Retrieve

Podcast Summarization and Embedding

Create an Extraction Graph

Upload an Audio

Retrieve Summary

Search Transcription Index

Object Detection on Images

Create an Extraction Graph

Upload Images

Retrieve Features of an Image

Query using SQL

PDF Extraction and Retrieval

Create an Extraction Graph

Upload a Document

Get Text, Image and Tables

LLM Framework Integration

Try out other extractors

Custom Extractors

Structured Data

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages