Welcome to the Retrieval Augmented Generation (RAG) repository! This project empowers users to perform Question-Answering (QnA) tasks over their own documents using the state-of-the-art RAG technique. By combining open-sourced Large Language Models (LLMs), Langchain and FastAPI, we provide a powerful and user-friendly platform for handling document-based QnA.
RAG Pipeline source.
In this section, we'll guide you through setting up and running RAG for your document-based QnA. Follow these steps to get started:
Create a vertual python env in your local directory and activate it.
python3.9 -m venv llm_env/
source activate llm_env/bin/activate
- Clone this repository to your local machine.
https://github.com/AshishSinha5/rag_api.git
cd rag_api
- Install the required Python packages.
pip install -r requirements.txt
- The project currntly uses plain C/C++ implementation of LLAMA2 model from this repository llama.cpp. The model can be downloaded from TheBloke's HuggingFace page.
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
We'll be using the SwaggerUI (that comes bundled with the FastAPI library) to interact with our API interface.
cd src/rag_app
uvicorn main:app
In your favorite browser, go to the following link -
http://127.0.0.1:8000/docs
To upload our document we'll send a POST request. During the upload procedure the following parameters are required -
collection_name
- Name of the vector db where you want to upload your document to. Newdb
will be created if the it doesn;t exist already, or the document will be appended to the exisitngdb
.file
- File to be uploaded. Currently onlypdf
andhtml
files are supported.
Uploading Documents to vector_db.
To perform the QnA on our documents we'll hit the query/
endpoint of our API. We'll need the following parameters to perform our query -
query
- The query string.n_result
- Number of most similar document chunks to load from ourvector_db
to create the relevant context for ourquery
.collection_name
- Name of thevector_db
we want to query.
Query Documents Using the LLM.
As we start our application the llama2.cpp LLM gets initialized with the default parameters. But we may wish to configure our LLM Model as per our liking. We can use the init_llm/
endpoint to configure the model. Currently following parameters are available to configure -
n_gpu_layers
- Number of layers to load on the GPU.n_ctx
- Token context window.n_batch
- Number of tokens to process in parallel. Should be a number between 1 andn_ctx
.max_tokens
- The maximum number of tokens to generate.temperature
- Temperature for sampling. Higher values means more random samples.
As llama.cpp
model allows for configurable parameters, they may be added in future.