ManiTest is a simplified version of the Eleuther-AI LLM evaluation harness that uses Manifest as its backend model server.
- Eleuther-AI LLM evaluation harness allows you to evaluate a large language model (LLM) on tasks formulated as prompts.
- Manifest is a model server that enables fast inference via built-in support for HuggingFace Parallelize, Accelerate, DeepSpeed, and BitsAndBytes.
Note: We use our own fork of Manifest, which we hope to merge back into the main repo soon.
pip3 install manitest "git+https://github.com/som-shahlab/manifest.git@eval-michael#egg=manifest-ml[api]"
To run the eval harness, you must first have a Manifest server running in the background with your desired model. You can then run the eval harness on your desired task. ManiTest comes with a couple tasks pre-loaded in the manitest.tasks
module, such as manitest.tasks.mednli
and manitest.tasks.scitail
.
# Run Manifest server with your desired model
python3 -m manifest.api.app \
--model_type huggingface \
--model_name_or_path gpt2 \
--model_generation_type text-generation \
--port 5001 &
# Run ManiTest evaluation harness on your desired task
# Note: To run the MedNLI task, you must first download the dataset from: https://physionet.org/content/mednli/1.0.0/
python3 -m manitest.main \
--manifest_url http://127.0.0.1:5001 \
--path_to_task manitest.tasks.mednli \
--output_dir ./ignore \
--data_dir /Users/mwornow/Downloads/mednli-a-natural-language-inference-dataset-for-the-clinical-domain-1.0.0/ \
--dataset_splits test
# Test a few-shot prompting setup
python3 src/manitest/main.py \
--manifest_url http://127.0.0.1:5001 \
--path_to_task manitest.tasks.mednli_fewshot \
--output_dir ./ignore \
--data_dir /Users/mwornow/Downloads/mednli-a-natural-language-inference-dataset-for-the-clinical-domain-1.0.0/ \
--dataset_splits test \
--n_shots 3
If you're using a causal LM (e.g. GPT, OPT, Llama, Bloom)...
- Run Manifest with the
--model_generation_type text-generation
flag
If you're using a seq2seq LM (e.g. T5, T0)...
- Run Manifest with the
--model_generation_type text2text-generation
flag
We recommend starting from the task_template.py
file as a template. You can also view tests/mednli/mednli.py
or tests/scitail/scitail.py
for worked-out examples of tasks.
To create your own task, you must...
-
Create a file called
your_task.py
. You can save this anywhere. -
Create a
Task
class that inherits frommanifest.base.Task
.
It must define three attributes (name
, task_type
, and prompts
) and one methods (load_dataset()
).
from base import Task, TaskType
class YourTask(Task):
name: str = "Your Task Name"
task_type: TaskType = TaskType.GENERATION
def load_dataset(self, dataloader: Optional[str], data_dir: Optional[str]) -> DatasetDict:
# Load your dataset here
return DatasetDict()
- Create a
Prompt
class that inherits frommanifest.base.Prompt
for each individual prompt associated with your task.
It must define one attribute (name
) and two methods (generate_prompt()
and get_label()
).
from base import Prompt
class YourPrompt(Prompt)
name: str = "Some globally unique name for this prompt"
def generate_prompt(self, example: dict) -> str:
"""Takes a dataset example and returns a prompted version of that example."""
return f"Premise: {example['premise']}\nHypothesis: {example['hypothesis']}. Does the premise entail the hypothesis?"
def get_label(self, example: dict) -> str:
"""Gets the ground truth label for a dataset example"""
return example['true_label']
- Run the evaluation harness with your task. This assumes a Manifest server is already running on
localhost:5000
python3 main.py \
--manifest_url http://localhost:5000 \
--path_to_task path/to/your_task.py \
--output_dir ./ignore
Installation:
# Download repo
git clone https://github.com/som-shahlab/manitest
cd manitest
# Create virtual environment + install dependencies
conda create --name manitest_env python=3.10 -y
conda activate manitest_env
poetry install
If you are running this on a computer without internet access (e.g. Stanford Nero), you will need to download the HuggingFace dataset, dataloader, and model that you want to use.
Assuming you've downloaded these, your commands will look like the following:
python3 -m manifest.api.app \
--model_type huggingface \
# Path to locally downloaded HuggingFace model
--model_name_or_path /local-scratch-nvme/nigam/huggingface/pretrained/gpt2-small \
--model_generation_type text-generation
python3 main.py \
--manifest_url http://127.0.0.1:5000 \
--path_to_task tests/mednli/mednli.py \
--output_dir ./ignore \
# Path to locally downloaded HuggingFace dataset
--data_dir /local-scratch/nigam/projects/clinical_llm/data/mednli/ \
--dataset_splits test \
# Path to locally downloaded HuggingFace dataloader
--dataloader /local-scratch/nigam/projects/clinical_llm/dataloaders/mednli/mednli.py
- Combine Manifest command with
main.py
command into a single command - Merge Manifest fork back into main repo
- Support specifying multiple tasks in a single run
- Pretty print / format results
- Support passing text generation flags to
main.py
, pass these along to Manifest - Add tests
- Documentation
- Multi-label classification task support
- Few-shot in-context examples
- Multi-class classification task support
- Text generation task support
- Test case with mednli replacement classification task
- Abstract things to work on not Nero (e.g. load HF models from Hub, load local HF models, hit external APIs)
- Clean up output to be more user friendly
- Convert .yaml prompt files to .py
Task
files
To run pre-commit checks:
pre-commit run --all-files
To run tests:
pytest tests