CourtBot is a chat bot which can access all decisions of the Supreme Court of the United States. It can answer questions based on the information in these decisions and cite the specific decisions which it used to answer the question.
pip install -r requirements.txt
./run.sh
[-s : scrape the supreme court cases]
[-e : to embed court decisions]
[-d : start backend]
[-c cases|db : delete all cases or database]
- Run
./run.sh -s
scrapes the supreme court cases from FindLaw and prepossesses them for embedding. The time taken to scrape the data is dependent on the number of threads which you allocate and it takes around 3 hours with 50 threads. The data is stored inSupremeCourtCases
directory as txt files. - Run
./run.sh -e
to embed the court cases for the Chroma database. This process is run on the gpu but could be run on the cpu. When running on the gpu this process takes about 14 hours to embed all 21 thousand cases. The Chroma database is saved in the.chroma
directory. - Run
./run.sh -d
to start the server. All queries to the vector database and the chat bot are made through this api.
- The data for the Supreme Court cases is scraped from FindLaw using Selenium and a chromium web driver.
- The embeddings are calculated using all-mpnet-base-v2 which is a fine tuned version of Microsoft's Masked and Permuted Language Modeling (MPNet) (arXiv).
- The vector database used is Chroma. Chroma is a vector database which allows you to search for similar vectors. It is built on top of DuckDB and Apache Parquet.
- The chat bot is built using Microsoft's Grounded Open Dialogue Language Model (GODEL) (arXiv). This model is trained on 551 million dialogs from Reddit and 5 million instruction and knowledge dialogs.
- The api is built using FastAPI which uses a uvicorn web server.