Multi Mediawiki RAG Chatbot

Chatbots are very popular right now. Most openly accessible information is stored in some kind of a Mediawiki. Creating a RAG Chatbot is becoming a very powerful alternative to traditional data gathering. This project is designed to create a basic format for creating your own chatbot to run locally on linux.

About

Mediawikis hosted by Fandom usually allow you to download an XML dump of the entire wiki as it currently exists. This project primarily leverages Langchain with a few other open source projects to combine many of the readily available quickstart guides into a complete vertical application based on mediawiki data.

Architecture

graph TD;
    a[/xml dump a/] --MWDumpLoader--> emb
    b[/xml dump b/] --MWDumpLoader--> emb
    emb{Embedding} --> db
    db[(Chroma)] --Document Retriever--> lc
    hf(Huggingface) --Sentence Transformer --> emb
    hf --LLM--> modelfile
    modelfile[/Modelfile/] --> Ollama
    Ollama(((Ollama))) <-.ChatOllama.-> lc
    Mem[(Memory)] <--Chat History--> lc
    lc{Langchain} <-.LLMChain.-> cl(((Chainlit)))
    click db href "https://github.com/chroma-core/chroma"
    click hf href "https://huggingface.co/"
    click cl href "https://github.com/Chainlit/chainlit"
    click lc href "https://github.com/langchain-ai/langchain"
    click Ollama href "https://github.com/jmorganca/ollama"

Runtime Filesystem

multi-mediawiki-rag
├── .chainlit
│   ├── .langchain.db # Server Cache
│   └── config.toml # Server Config
├── chainlit.md
├── config.yaml
├── data # VectorDB
│   ├── 47e4e036-****-****-****-************
│   │   └── *
│   └── chroma.sqlite3
├── app.py
├── discord.py
├── embed.py
├── memory
│   └── cache.db # Chat Cache
└── model
    └── sentence-transformers_all-mpnet-base-v2
        └── *

Quickstart

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

These steps assume you are using a modern Linux OS like Ubuntu 22.04 with Python 3.10+.

apt-get install -y curl git python3-venv sqlite3
git clone https://github.com/tylertitsworth/multi-mediawiki-rag.git
curl https://ollama.ai/install.sh | sh
python -m .venv venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt

Run the above setup steps
Download a mediawiki's XML dump by browsing to /wiki/Special:Statistics or using a tool like wikiteam3
1. If Downloading, download only the current pages, not the entire history
2. If using wikiteam3, scrape only namespace 0
3. Provide in the following format: sources/<wikiname>_pages_current.xml
Edit config.yaml with the location of your XML mediawiki data you downloaded in step 1 and other configuration information

Caution

Installing Ollama will create a new user and a service on your system. Follow the manual installation steps to avoid this step and instead launch the ollama API using ollama serve.

Create Custom LLM

After installing Ollama we can use a Modelfile to download and tune an LLM to be more precise for Document Retrieval QA.

ollama create volo -f ./Modelfile

Tip

Choose a model from the Ollama model library and download with ollama pull <modelname>:<version>, then edit the model field in config.yaml with the same information.

Use Model from Huggingface

Download a model of choice from Huggingface with git clone https://huggingface.co/<org>/<modelname> model/<modelname>.
If your model of choice is not in GGUF format, convert it with docker run --rm -v $PWD/model/<modelname>:/model ollama/quantize -q q4_0 /model.
Modify the Modelfile's FROM line to contain the path to the q4_0.bin file in the modelname directory.

Create Vector Database

Your XML data needs to be loaded and transformed into embeddings to create a Chroma VectorDB.

python embed.py

Expected Output

2023-12-16 09:50:53 - Loaded .env file
2023-12-16 09:50:55 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 09:51:18 - Use pytorch device: cpu
2023-12-16 09:56:09 - Anonymized telemetry enabled. See
https://docs.trychroma.com/telemetry for more information.
Batches: 100%|████████████████████████████████████████| 1303/1303 [1:23:14<00:00,  3.83s/it]
...
Batches: 100%|████████████████████████████████████████| 1172/1172 [1:04:08<00:00,  3.28s/it]
023-12-16 19:47:01 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 19:47:33 - Use pytorch device: cpu
Batches: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
A Tako was an intelligent race of octopuses found in the Kara-Tur setting. They were known for
their territorial nature and combat skills, as well as having incredible camouflaging abilities
that allowed them to blend into various environments. Takos lived in small tribes with a
matriarchal society led by one or two female rulers. Their diet consisted mainly of crabs,
lobsters, oysters, and shellfish, while their ink was highly sought after for use in calligraphy
within Kara-Tur.

Add Different Document Type to DB

Choose a new File type Document Loader or App Document Loader, and add them using your own script. Check out the provided Example.

Start Chatbot

chainlit run app.py -h

Access the Chatbot GUI at http://localhost:8000.

Start Discord Bot

export DISCORD_BOT_TOKEN=...
chainlit run app.py -h

Tip

Develop locally with ngrok.

Hosting

This chatbot is hosted on Huggingface Spaces for free, which means this chatbot is very slow due to the minimal hardware resources allocated to it. Despite this, the provided Dockerfile provides a generic method for hosting this solution as one unified container, however this method is not ideal and can lead to many issues if used for professional production systems.

Testing

Cypress

Cypress tests modern web applications with visual debugging. It is used to test the Chainlit UI functionality.

npm install
# Run Test Suite
bash cypress/test.sh

Note

Cypress requires node >= 16.

Pytest

Pytest is a mature full-featured Python testing tool that helps you write better programs.

pip install pytest
# Test Embedding Functions
pytest test/test_embed.py -W ignore::DeprecationWarning
# Test e2e with Ollama Backend
pytest test -W ignore::DeprecationWarning

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.chainlit		.chainlit
.github		.github
cypress		cypress
examples		examples
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Modelfile		Modelfile
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
config.yaml		config.yaml
embed.py		embed.py
package-lock.json		package-lock.json
package.json		package.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

tylertitsworth/multi-mediawiki-rag

Folders and files

Latest commit

History

Repository files navigation

Multi Mediawiki RAG Chatbot

Table of Contents

About

Architecture

Runtime Filesystem

Quickstart

Prerequisites

Create Custom LLM

Use Model from Huggingface

Create Vector Database

Expected Output

Add Different Document Type to DB

Start Chatbot

Start Discord Bot

Hosting

Testing

Cypress

Pytest

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages