GitHub - Netruk44/repo-search: Search for code by what it does in natural language, using machine learning embeddings.

RepoSearch

Description

RepoSearch is a tool for searching through repositories of text and source code using natural language queries, based on embeddings from a custom-specified model.

Current options for model are:

Instructor for local generation (default, GPU recommended but not required)
OpenAI Embeddings for remote generation

Example Usage

Local Repository

Generating embeddings from a local copy of the OpenMW (open source game engine) repository, then querying it.

$ repo_search generate openmw ~/Developer/openmw/apps
  <output trimmed for brevity>
100%|██████████████████████████████| 1386/1386 [05:53<00:00,  3.92it/s]

$ repo_search query openmw "Example of making an NPC navigate towards a specific destination."
  <output trimmed for brevity>
100%|██████████████████████████████| 1386/1386 [00:00<00:00, 3533.53it/s]

"Example of making an NPC navigate towards a specific destination."

89.55% match    openmw/mwmechanics/aipackage.cpp [25%-31% of the way through]
88.99% match    openmw/mwmechanics/aiwander.cpp [74%-78% of the way through]
87.65% match    openmw/mwmechanics/aipursue.cpp [33%-67% of the way through]
87.42% match    openmw/mwmechanics/aicombat.cpp [33%-38% of the way through]
87.19% match    openmw/mwmechanics/aitravel.cpp [40%-60% of the way through]
87.11% match    openmw/mwmechanics/pathfinding.cpp [82%-88% of the way through]
86.88% match    openmw/mwgui/dialogue.cpp [64%-68% of the way through]
86.81% match    openmw/mwmechanics/aipackage.hpp [40%-60% of the way through]
86.63% match    openmw/mwmechanics/aiwander.hpp [60%-80% of the way through]
86.30% match    openmw/mwmechanics/character.cpp [65%-66% of the way through]

Zip File Download + Embedding with OpenAI

Downloading the latest state of the Borg Backup repository from GitHub, generating embeddings using OpenAI Embeddings, then querying it.

$ export OPENAI_API_KEY=sk-...
$ repo_search generate borg https://github.com/borgbackup/borg/archive/refs/heads/master.zip --model_type openai
  <output trimmed for brevity>
100%|██████████████████████████████| 425/425 [02:17<00:00,  3.09it/s]

$ repo_search query borg "Code implementing file chunking and deduplication."
  <output trimmed for brevity>
100%|██████████████████████████████| 425/425 [00:00<00:00, 3524.80it/s]

"Code implementing file chunking and deduplication."

77.95% match    borg-master/scripts/fuzz-cache-sync/testcase_dir/test_simple [0%-100% of the way through]
77.67% match    borg-master/src/borg/chunker.pyx [0%-100% of the way through]
76.25% match    borg-master/docs/usage/notes.rst [0%-100% of the way through]
76.01% match    borg-master/docs/misc/internals-picture.txt [0%-100% of the way through]
75.92% match    borg-master/src/borg/hashindex.pyi [0%-100% of the way through]
75.88% match    borg-master/src/borg/chunker.pyi [0%-100% of the way through]
75.46% match    borg-master/src/borg/testsuite/chunker.py [0%-100% of the way through]
74.90% match    borg-master/src/borg/_chunker.c [0%-100% of the way through]
74.68% match    borg-master/src/borg/cache.py [50%-100% of the way through]
74.18% match    borg-master/src/borg/_hashindex.c [0%-100% of the way through]

Install

Install Steps

Open a terminal.
[Optional] Create a virtual/conda/whatever environment.
[Optional] Install requirements into your environment.
pip install git+https://github.com/Netruk44/repo-search
repo_search --help

Requirements

Pip should install missing requirements automatically. Though you may want to install the following ahead of time to speed up the process:

Arguments

repo_search <generate|query> <repository_name> <arguments>

Argument	Description
`generate`	Generate embeddings for a repository.
`query`	Query a repository for files similar to the given query.
`repository_name`	The name for a collection of embeddings

Optional Shared Arguments

Argument	Description
`--embeddings_dir`	The directory to store the generated embeddings in. Default: An `embeddings` directory located in the folder RepoSearch was installed to
`--verbose`	Whether or not to print verbose output. Default: Off

Generate Arguments

repo_search generate <repository_name> <repository_source> <arguments>

repository_source can be one of:

A path to a local directory
A path to a local zip file
A URL to a zip file to download
A URL to a GitHub repository to download (the main branch is downloaded)

Argument	Description
`--model_type`	The type of model to use for generating or querying embeddings. See Available Model Types for more information. Default: `instructor`
`--model_name`	The name of the model to use for generating or querying embeddings. Options available depend on model type.

Query Arguments

repo_search query <repository_name> <query>

query is a string containing the query to search for.

How does it work?

For each file in the repository, the embeddings are sent to a customizable model (default: instructor-large) to generate an embedding. If a file is too long to fit within a single embedding, it is split into smaller chunks and each chunk is embedded separately.

The retrieved embeddings are stored in a HuggingFace Datasets dataset. Check out the schema for more information about using the generated dataset.

Possible TODO: The embeddings are indexed using FAISS, which allows for fast nearest neighbor searches to your queries.

Dataset Schema

The generated dataset consists of just two columns.

Column Name Description

file_path The path to the file that was embedded. Useful for displaying to the user.

embeddings An array of embeddings for the file. An empty array indicates an error occurred when generating embeddings for the file. The array may have one or more embeddings within it, depending on the source file length.

Available Model Types

--model_type specifies which model should be used to generate the embeddings. Currently there are two options: instructor and openai.

Instructor (Default)

--model_type instructor

By default, RepoSearch uses instructor-large to generate the embeddings.

--model_name:

Model Name Description

hkunlp/instructor-large The default model. Requires ~2.5 GB of VRAM to run.

hkunlp/instructor-xl A larger version of the default model. Requires ~6 GB of VRAM.

OpenAI

--model_type openai

Note: Using this model type requires you to supply your own OpenAI API key!

export OPENAI_API_KEY=sk-...

Warning: You should not use this model with any extremely sensitive code or data! The contents of all files will be sent to OpenAI's API for embedding generation.

--model_name:

Model Name Description

text-embedding-ada-002 The default model.

Cost: Cost per query is negligible, almost always less than 1/10th of a penny unless you're writing paragraphs of text.

Generating embeddings:

For the OpenMW repository (generating embeddings for ~9 MB worth of source files) costs ~$0.20 USD.

For the Borg Backup repository (<5 MB of source) costs ~$0.10 USD.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
repo_search		repo_search
.gitignore		.gitignore
license.md		license.md
readme.md		readme.md
setup.py		setup.py
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repo_search

repo_search

.gitignore

.gitignore

license.md

license.md

readme.md

readme.md

setup.py

setup.py

todo.md

todo.md

Repository files navigation

RepoSearch

Description

Example Usage

Local Repository

Zip File Download + Embedding with OpenAI

Install

Install Steps

Requirements

Arguments

Optional Shared Arguments

Generate Arguments

Query Arguments

How does it work?

Dataset Schema

Available Model Types

Instructor (Default)

OpenAI

About

Releases

Languages

Column Name	Description
`file_path`	The path to the file that was embedded. Useful for displaying to the user.
`embeddings`	An array of embeddings for the file. An empty array indicates an error occurred when generating embeddings for the file. The array may have one or more embeddings within it, depending on the source file length.

Model Name	Description
`hkunlp/instructor-large`	The default model. Requires ~2.5 GB of VRAM to run.
`hkunlp/instructor-xl`	A larger version of the default model. Requires ~6 GB of VRAM.

License

Netruk44/repo-search

Folders and files

Latest commit

History

Repository files navigation

RepoSearch

Description

Example Usage

Local Repository

Zip File Download + Embedding with OpenAI

Install

Install Steps

Requirements

Arguments

Optional Shared Arguments

Generate Arguments

Query Arguments

How does it work?

Dataset Schema

Available Model Types

Instructor (Default)

OpenAI

About

Topics

Resources

License

Stars

Watchers

Forks

Languages