Skip to content

parea-ai/tool-use-benchmark

Repository files navigation

Benchmarking Anthropic's Tool Use Beta API

You can see a discussion of the results in the blog post and the details of the experiments here.

TL;DR:

  • Haiku is the best model for tool use when only a single function call should be generated.
  • However, when you need parallel tool use, GPT-4 Turbo is still the best model.
  • Noteworthy, GPT-3.5 Turbo appears biased towards generating multiple function calls in parallel, no matter if that’s required or not.

Results

Prepare Data

Following the Gorilla repo, download the data from HuggingFace to the ./data folder:

huggingface-cli download gorilla-llm/Berkeley-Function-Calling-Leaderboard --local-dir ./data --repo-type dataset

Then, manually download the possible answers into data/possible_answer.

Reproduce Results

  1. Install the requirements: pip install -r requirements.txt
  2. Get a Parea API key from here.
  3. Copy the .env.example file to .env and fill in the API keys for Parea, OpenAI & Anthropic.
  4. Run the experiments: python3 experiment.py

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published