Skip to content

Easily generate synthetic data for classification tasks using LLMs

License

Notifications You must be signed in to change notification settings

daspartho/DistillClassifier

Repository files navigation

DistillClassifier

About

DistillClassifier is a tool built on top of LLM-VM to easily generate synthetic data for classification tasks using LLMs for distilling LLM knowledge for classification task into much smaller and faster-to-run classification models.

This project was build for the ANARCHY October 2023 Hackathon. Checkout ANARCHY on their github and website.

Team Members:

Setup

clone the project from github

git clone https://github.com/daspartho/DistillClassifier

cd into the project

cd DistillClassifier

install LLM-VM

git clone https://github.com/anarchy-ai/LLM-VM.git
cd LLM-VM
pip3 install .
cd ..

install python dependencies

pip3 install -r requirements.txt

create an .env file and set OpenAI API key (if you want to use openai models) and Huggingface Hub Token (if you want to push the dataset to huggingface):

OPENAI_API_KEY=
HF_HUB_TOKEN=

Run

You can run the tool from command line like this:

python3 generation.py <columns> <n_examples> [-m <model>] [-f <filename>] [-r <repo>]

Arguments:

  • <columns>: Column information as a dictionary.
  • <n_examples>: Number of examples to be generated.
  • -m, --model: (Optional) Model name. Defaults to "chat_gpt".
  • -f, --filename: (Optional) Dataset filename. Defaults to "dataset.json".
  • -r, --repo: (Optional) HuggingFace repo ID". Defaults to "None"

Example:

python3 generation.py '{"text": "either spoiler or not spoiler text", "label": "if text is spoiler or not"}' 25 -m 'chat_gpt' -f 'dataset.json' -r 'spoiler_or_not'

or run the demo.py file directly:

python3 demo.py

example output dataset:

LICENSE

MIT