Skip to content

The Official Repo of ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks (


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

📖 Paper • 🚀 Github Page • 📊 Data

Alt text

Table of Contents

📋 Prerequisites

To clone this repository with all its submodules, use the --recurse-submodules flag:

git clone --recurse-submodules
cd ML-Bench

If you have already cloned the repository without the --recurse-submodules flag, you can run the following commands to fetch the submodules:

git submodule update --init --recursive

Then run pip install -r requeirments.txt

🦙 ML-LLM-Bench

📋 Prerequisites

After clone submodules, you can run

cd utils

bash to generate full and quarter benchmark into merged_full_benchmark.jsonl and merged_quarter_benchmark.jsonl

🌍 Environment Setup

To run the ML-LLM-Bench Docker container, you can use the following command:

docker pull
docker run -it -v ML_Bench:/deep_data /bin/bash

To download model weights and prepare files, run

'cd utils'


It may take 2 hours to automatically prepare them.

🛠️ Usage

Place your results in utils/results directory, and update the --result_path in with your path. Also, modify the log address.

Then run bash And you can check the run logs in your log file, view the overall results in eval_total_user.jsonl, and see the results for each repository in eval_result_user.jsonl.

Both JSONL files starting with eval_result and eval_total contain partial execution results in our paper.

  The `utils/results` folder includes the model-generated outputs we used for testing.
  The `utils/exec_logs` folder saves our the execute log.
  The `` file is not for users, it is used to store the code written by models.
  Additionally, the execution process may generate new unnecessary files.

📞 API Calling

To reproduce OpenAI's performance on this task, use the following script:

bash script/openai/

You need to change the parameter settings in script/openai/

  • type: Choose from quarter or full.
  • model: Model name.
  • input_file: File path of the dataset.
  • answer_file: Original answer in JSON format from GPT.
  • parsing_file: Post-process the output of GPT in JSONL format to obtain executable code segments.
  • readme_type: Choose from oracle_segment and readme.
    • oracle_segment: The code paragraph in the README that is most relevant to the task.
    • readme: The entire text of the README in the repository where the task is located.
  • engine_name: Choose from gpt-35-turbo-16k and gpt-4-32.
  • n_turn: Number of executable codes GPT returns (5 times in the paper experiment).
  • openai_key: Your OpenAI API key.

Please refer to openai for details.

🔧 Open Source Model Fine-tuning

📋 Prerequisites

Llama-recipes provides a pip distribution for easy installation and usage in other projects. Alternatively, it can be installed from the source.

  1. Install with pip
pip install --extra-index-url llama-recipes
  1. Install from source To install from source e.g. for development use this command. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.
git clone
cd llama-recipes
pip install -U pip setuptools
pip install --extra-index-url -e .

🏋️ Fine-tuning

By definition, we have three tasks in the paper.

  • Task 1: Given a task description + Code, generate a code snippet.
  • Task 2: Given a task description + Retrieval, generate a code snippet.
  • Task 3: Given a task description + Oracle, generate a code snippet.

You can use the following script to reproduce CodeLlama-7b's fine-tuning performance on this task:

torchrun --nproc_per_node 2 \
    --use_peft \
    --peft_method lora \
    --enable_fsdp \
    --model_name codellama/CodeLlama-7b-Instruct-hf \
    --context_length 8192 \
    --dataset mlbench_dataset \
    --output_dir OUTPUT_PATH \
    --task TASK \
    --data_path DATA_PATH \

You need to change the parameter settings of OUTPUT_PATH, TASK, and DATA_PATH correspondingly.

  • OUTPUT_DIR: The directory to save the model.
  • TASK: Choose from 1, 2 and 3.
  • DATA_PATH: The directory of the dataset.

🔍 Inference

You can use the following script to reproduce CodeLlama-7b's inference performance on this task:

python \
    --model_name 'codellama/CodeLlama-7b-Instruct-hf' \
    --peft_model PEFT_MODEL \
    --prompt_file PROMPT_FILE \
    --task TASK \

You need to change the parameter settings of PEFT_MODEL, PROMPT_FILE, and TASK correspondingly.

  • PEFT_MODEL: The path of the PEFT model.
  • PROMPT_FILE: The path of the prompt file.
  • TASK: Choose from 1, 2 and 3.

Please refer to finetune for details.

🤖 ML-Agent-Bench

🌍 Environment Setup

To run the ML-Agent-Bench Docker container, you can use the following command:

docker pull
docker run -it /bin/bash

This will pull the latest ML-Agent-Bench Docker image and run it in an interactive shell. The container includes all the necessary dependencies to run the ML-Agent-Bench codebase.

For ML-Agent-Bench in OpenDevin, please refer to the OpenDevin setup guide.

Please refer to envs for details.

📝 Cite Us

This project is inspired by some related projects. We would like to thank the authors for their contributions. If you find this project or dataset useful, please cite it:

      title={ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks}, 
      author={Yuliang Liu and Xiangru Tang and Zefan Cai and Junjie Lu and Yichi Zhang and Yanjun Shao and Zexuan Deng and Helan Hu and Zengxian Yang and Kaikai An and Ruijun Huang and Shuzheng Si and Sheng Chen and Haozhe Zhao and Zhengliang Li and Liang Chen and Yiming Zong and Yan Wang and Tianyu Liu and Zhiwei Jiang and Baobao Chang and Yujia Qin and Wangchunshu Zhou and Yilun Zhao and Arman Cohan and Mark Gerstein},
      journal={arXiv preprint arXiv:2311.09835},

📜 License

Distributed under the MIT License. See LICENSE for more information.


The Official Repo of ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks (








No releases published


No packages published