Table of Contents
- A unified API for testing and integrating OpenAI and HuggingFace LLM models.
- Load models from HuggingFace with just a URL.
- Uses Llama.cpp server API rather than bindings, so as long as the Llama.cpp server API remains stable this project will remain usable.
- Prebuilt agents - not chatbots - to unlock the true power of LLMs.
// Use an OpenAI model
let llm_definition = LlmDefinition::OpenAiLlm(OpenAiDef::Gpt35Turbo)
// Or use a model from hugging face
let llm_definition: LlmDefinition = LlmDefinition::LlamaLlm(LlamaDef::new(
MISTRAL7BCHAT_MODEL_URL,
LlamaPromptFormat::Mistral7BChat,
Some(9001), // Max tokens for model AKA context size
Some(2), // Number of threads to use for server
Some(22), // Layers to load to GPU. Dependent on VRAM
Some(false), // This starts the llama.cpp server with embedding flag disabled
Some(true), // Logging enabled
));
let response = basic_text_gen::generate(
&LlmDefinition::LlamaLlm(llm_definition),
Some("Howdy!"),
)
.await?;
eprintln!(response)
if !boolean_classifier::classify(
llm_definition,
Some(hopefully_a_list),
Some("Is the attached feature a list of content split into discrete entries?"),
)
.await?
{
panic!("{}, was not properly split into a list!", hopefully_a_list)
}
let client_openai: ProviderClient =
ProviderClient::new(&LlmDefinition::OpenAiLlm(OpenAiDef::EmbeddingAda002), None).await;
let _: Vec<Vec<f32>> = client_openai
.generate_embeddings(
&vec![
"Hello, my dog is cute".to_string(),
"Hello, my cat is cute".to_string(),
],
Some(EmbeddingExceedsMaxTokensBehavior::Panic),
)
.await
.unwrap();
- Currently with limited support for llama.cpp
cargo run -p llm_client --bin server_runner start --model_url "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q8_0.gguf"
$ llama server listening at http://localhost:8080
cargo run -p llm_client --bin server_runner stop
cargo run -p llm_client --bin model_loader_cli --model_url "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q8_0.gguf"
async-openai is used to interact with the OpenAI API. A modifed version of the async-openai crate is used for the Llama.cpp server. If you just need an OpenAI API interface, I suggest using the async-openai crate.
Hugging Face's rust client is used for model downloads from the huggingface hub.
- Clone repo:
git clone https://github.com/ShelbyJenkins/llm_client.git
cd llm_client
-
Optional: Build devcontainer from
llm_client/.devcontainer/devcontainer.json
This will build out a dev container with nvidia dependencies installed. -
Add llama.cpp:
git submodule init
git submodule update
- Build llama.cpp ( This is dependent on your hardware. Please see full instructions here):
// Example build for nvidia gpus
cd llm_client/src/providers/llama_cpp/llama_cpp
make LLAMA_CUDA=1
- Test llama.cpp ./server
cargo run -p llm_client --bin server_runner start --model_url "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q8_0.gguf"
This will download and load the given model, and then start the server.
When you see llama server listening at http://localhost:8080
, you can load the llama.cpp UI in your browser.
Stop the server with cargo run -p llm_client --bin server_runner stop
.
- Using OpenAi: Add a
.env
file in the llm_client dir with the varOPENAI_API_KEY=<key>
- Handle the various prompt formats of LLM models more gracefully
- Unit tests
- Add additional classifier agents:
- many from many
- one from many
- Implement all openai functionality with llama.cpp
- More external apis (claude/etc)
This is my first Rust crate. All contributions or feedback is more than welcomed!
Distributed under the MIT License. See LICENSE.txt
for more information.
Shelby Jenkins - Here or Linkedin