Benchmarking Anthropic's Tool Use Beta API

You can see a discussion of the results in the blog post and the details of the experiments here.

TL;DR:

Haiku is the best model for tool use when only a single function call should be generated.
However, when you need parallel tool use, GPT-4 Turbo is still the best model.
Noteworthy, GPT-3.5 Turbo appears biased towards generating multiple function calls in parallel, no matter if that’s required or not.

Prepare Data

Following the Gorilla repo, download the data from HuggingFace to the ./data folder:

huggingface-cli download gorilla-llm/Berkeley-Function-Calling-Leaderboard --local-dir ./data --repo-type dataset

Then, manually download the possible answers into data/possible_answer.

Reproduce Results

Install the requirements: pip install -r requirements.txt
Get a Parea API key from here.
Copy the .env.example file to .env and fill in the API keys for Parea, OpenAI & Anthropic.
Run the experiments: python3 experiment.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
evals		evals
inference		inference
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
experiment.py		experiment.py
load_data.py		load_data.py
plots.ipynb		plots.ipynb
requirements.txt		requirements.txt
results.csv		results.csv
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals

evals

inference

inference

.env.sample

.env.sample

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

constants.py

constants.py

experiment.py

experiment.py

load_data.py

load_data.py

plots.ipynb

plots.ipynb

requirements.txt

requirements.txt

results.csv

results.csv

results.png

results.png

Repository files navigation

Benchmarking Anthropic's Tool Use Beta API

Prepare Data

Reproduce Results

About

Releases

Packages

Languages

License

parea-ai/tool-use-benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Anthropic's Tool Use Beta API

Prepare Data

Reproduce Results

About

Resources

License

Stars

Watchers

Forks

Languages