Skip to content

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

License

Notifications You must be signed in to change notification settings

MileBench/MileBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MileBench πŸ›£οΈ

🌐 Homepage | πŸ€— Dataset | πŸ€— Paper | πŸ“– arXiv | GitHub

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context".

Python 3.10+ Pytorch 2.1.1 transformers accelerate

🌈 Update

  • [2024.4.15] πŸŽ‰πŸŽ‰πŸŽ‰ MileBench is public!πŸŽ‰πŸŽ‰πŸŽ‰

Contents

Introduction

We introduce MileBench, a pioneering benchmark designed to rigorously test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises a mix of text and images, long contexts, multiple tasks, and tasks requiring both comprehension and generation. To systematically assess the capabilities of MLLM in multimodal long contexts, our benchmark consists of two distinct evaluation sets, diagnostic evaluation and realistic evaluation. The former explores the long-context recall abilities of MLLMs, using needle-in-a-haystack and image retrieval tasks, while the latter stress-tests the model in a manner akin to real-world conditions using both temporal multi-image tasks and semantic multi-image tasks.

After evaluating 20 models, the closed-source Gemini 1.5 excelled in the realistic evaluation, achieving an impressive score of 54.7%, though it still falls short of a perfect 100% score. Meanwhile, GPT-4(Vision) managed to reach a peak score of 99.4% in the diagnostic evaluation. On the contrary, most open-source multimodal models struggled with long-context tasks. Only VILA and Qwen-VL-7B managed average scores of 44.4% and 37.2% in realistic and diagnostic evaluations respectively. These results underscore that there are "miles to go" towards fully-realized long-context MLLMs, prompting a call for increased research focus on such tasks, especially those involving numerous images.

MileBench Examples

Preparation

πŸ€— Dataset Preparation

The MileBench dataset comprises 6,440 samples from 29 datasets, with each sample containing multiple images. The data has been archived on the cloud and can be downloaded from HuggingFace link or BaiduYun link. Save the dataset under data folder.

πŸ€– Environment Setup

Install required packages:

pip install -r requirements.txt

ℹ️ How to Evaluate

Modify model configuration file

Click to expand

In configs/model_configs.yaml:

# Add a new model "my_model"
my_model:
    model_name: "my_model"
    model_dir: "path/to/full/model" # HuggingFace model weights
    cfg_path: "path/to/full/model_config"   # can be none
    gen_kwargs:
        max_new_tokens: 512
        min_new_tokens: 1
        do_sample: false

Modify model worker

Click to expand

In workers/model_workers.py:

  1. add a new model class
class MyModel(BaseWorker):

    def init_components(self, config) -> None:
        # init the model components

    def forward(self, questions: list[str], image_paths: list[list], device, gen_kwargs) -> list[str]:
        # Prepare images and text for generate function
  1. for github packages of different VLM models, we recommand you to save them to ./packages directory. Then you don't need to install pip packages in your env.

Modify utils.py

Click to expand

In utils.py: import your model

from workers.model_workers import MyModel   # modify here

name2worker = {
    "my_model": MyModel,  # modify here
    }

Generate response

Click to expand Set GPU num in `/configs/accelerate_configs.yaml`:
num_processes: GPU_NUM    # modify here

Modify eval.sh:

gpu_num=GPU_NUM  # modify here

for model in my_model; do  # modify here
    for dataset_name in dataset_name; do  # modify here
...

and run:

source eval.sh

Run evaluation

Click to expand

run:

python score.py \
    --result-dir outputs \
    --models my_model  # models to eval
# Result saved to outputs/result.csv

License

Code License Data License

All software is licensed under the Apache License, Version 2.0 (Apache 2.0). All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY).

Declaration

The dataset we're using is an aggregation of publicly accessible datasets licensed under the Creative Commons license (CC-BY) or other open-source licenses. We've meticulously adhered to all required legal procedures to incorporate this data into our research, recognizing the importance of transparency in data licensing for proper attribution and suitable data utilization. Our dataset also encompasses images derived from publicly accessible datasets and language data created through the GPT-4V api. While measures have been put in place to secure suitable content, we acknowledge the potential existence of problematic content. Should you come across any such content, we urge you to inform us immediately so we can make the necessary adjustments to sustain a dataset free from inappropriate content. We are unwavering in our commitment to maintain a high-quality, ethically responsible dataset and promise to uphold principles of privacy and transparency throughout our work.

Contact

Citation

If you find this repository helpful, please consider citing it:

@article{song2024milebench,
  title={MileBench: Benchmarking MLLMs in Long Context},
  author={Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou},
  journal={arXiv preprint arXiv:2404.18532},
  year={2024}
}