Skip to content
View VPGTrans's full-sized avatar
Block or Report

Block or report VPGTrans

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
VPGTrans/README.md

VPGTrans: Transfer Visual Prompt Generator across LLMs

Ao Zhang, Hao Fei*, Yuan Yao*, Wei Ji, Li Li, Zhiyuan Liu and Tat-Seng Chua. (*Correspondence )

National University of Singapore, Tsinghua University

Project page: VPGTrans

This repo contains all the source codes for the proposed methods and the two newly-built VL-LLMs.


What's New: πŸŽ‰

  • 2023.10.13 Our VPGTrans is accepted to NeurIPS 2023!!!
  • 2023.05.01 Rlease the code.

Table of Contents

Introduction

While developing a new vision-language LLM (VL-LLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG part of the VL-LLM still suffers from indispensable computational costs.

In this project, we develop a VPGTrans framework for transferring VPG across LLMs to build VL-LLMs at significantly lower cost. The GPU hours can be reduced over 10 times and the training data can be reduced to around 10%: demo

VPGTrans comprises two stages of training: demo

The most exciting thing is that VPGTrans enables to customize new VL-LLMs with newly released LLMs. In this project, we release two novel VL-LLMs via our VPGTrans, including VL-LLaMA and VL-Vicuna.

VL-LLaMA

We customize a VL-LLaMA, a multimodal version LLaMA by transferring the BLIP-2 OPT-6.7B to LLaMA via VPGTrans. The performance of VL-LLaMA is summarized as follow:










VL-Vicuna

We also build a GPT-4-like multimodal chatbot, VL-Vicuna, based on the Vicuna LLM. Checkout our demo of VL-Vicuna: demo

Installation

1. Prepare the code

git clone https://github.com/VPGTrans/VPGTrans.git
cd VPGTrans
pip install -r requirements.txt
pip install -e .

VL-Vicuna Demo

1. Prepare the pretrained Vicuna weights
To run VL-Vicuna locally, you need to first prepare the
The current version of VL-Vicuna is built on the v0 versoin of Vicuna-7B. Please refer to the instruction here to prepare the Vicuna weights. The final weights would be in a single folder in a structure similar to the following:

vicuna_weights
β”œβ”€β”€ config.json
β”œβ”€β”€ generation_config.json
β”œβ”€β”€ pytorch_model.bin.index.json
β”œβ”€β”€ pytorch_model-00001-of-00003.bin
...   

2. Run the code
Please modify the llama_model lavis/projects/blip2/demo/vl_vicuna_demo.yaml line 12 to your vicuna checkpoint.

Then, run:

python webui_demo.py --cfg-path lavis/projects/blip2/demo/vl_vicuna_demo.yaml  --gpu-id 0

Note the checkpoint will be automatically downloaded.

Evaluation

Prepare the data
Please first refer to here to prepare the dataset you want to evaluate. We use COCO caption, NoCaps, VQAv2, GQA, and OK-VQA in our paper.

VL-LLaMA Evaluation

1. Prepare the pretrained LLaMA weights
LLaMA checkpoint is the start of Vicuna. Please also refer to the instruction here to prepare the LLaMA weights.

2. Run the code
Please run:

bash run_scripts/blip2/scale_up_eval/eval_vqa_llama.sh /path/to/llama_7b_dir/ # zero-shot vqav2 eval
bash run_scripts/blip2/scale_up_eval/eval_gqa_llama.sh /path/to/llama_7b_dir/ # zero-shot gqa eval
bash run_scripts/blip2/scale_up_eval/eval_okvqa_llama.sh /path/to/llama_7b_dir/ # zero-shot okvqa eval

Training

The stage-1 pre-training requires COCO caption and SBU. The stage-2 also requires VG caption and Laion-COCO. Please refer to here for data downloading.

If you want to remove some datasets in the training. Please comment the items in the config file, like:

# lavis/projects/blip2/train/llama_vpgtrans_step1_proj_warmup.yaml
datasets:
  coco_caption:
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
        eval:
          name: "blip_image_eval"
          image_size: 224
    text_processor:
        train:
          name: "blip_caption"
        eval:
          name: "blip_caption"
#  the SBU will not be used in the pre-training
#  sbu_caption:
#    vis_processor:
#        train:
#          name: "blip2_image_train"
#          image_size: 224
#    text_processor:
#        train:
#          name: "blip_caption"

VL-LLaMA Training

1. Stage-1 Projector Warm-up
First, you need to download the BLIP2 OPT-6.7B checkpoint.

The 1st thing is to initialize the projector with word convertor. (Optional)

CUDA_VISIBLE_DEVICES=0 python tools/linear_proj/train_linear_proj_opt_and_llama.py \
  facebook/opt-6.7b \
  /path/to/llama_7b_dir/ \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/output_dir
# example:
# CUDA_VISIBLE_DEVICES=0 python tools/linear_proj/train_linear_proj_opt_and_llama.py \
#   facebook/opt-6.7b \
#   ./llama-7b/ \
#   ./blip2_pretrained_opt6.7b.pth  \
#   ./lavis/output/proj_init

Then, run the projector warm-up:

bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step1_proj_warmup.sh \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/projector_init_weight \
  /path/to/llama_7b_dir/
 # example:
# bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step1_proj_warmup.sh \
#  ./blip2_pretrained_opt6.7b.pth \
#  ./lavis/output/proj_init/xxx.pth \
#  ./llama-7b/

2. Stage-2 Direct Fine-tuning
Please run:

bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step2_direct_finetune.sh \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/stage1_proj_warmup_checkpoint \
  /path/to/llama_7b_dir/
 # example:
# bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step2_direct_finetune.sh \
#  ./blip2_pretrained_opt6.7b.pth \
#  ./lavis/output/vl-llama_stage1/checkpoint_0.pth \
#  ./llama-7b/

VL-Vicuna Training

Vicuna is the instruction-tuning version of LLaMA. Most of the scripts are similar with training LLaMA.

1. Stage-1 Projector Warm-up
First, you need to download the BLIP2 OPT-6.7B checkpoint.

The 1st thing is to initialize the projector with word convertor. (Optional)

CUDA_VISIBLE_DEVICES=0 python tools/linear_proj/train_linear_proj_opt_and_llama.py \
  facebook/opt-6.7b \
  /path/to/vicuna_7b_dir/ \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/output_dir
# example:
# CUDA_VISIBLE_DEVICES=0 python tools/linear_proj/train_linear_proj_opt_and_llama.py \
#   facebook/opt-6.7b \
#   ./vicuna-7b/ \
#   ./blip2_pretrained_opt6.7b.pth  \
#   ./lavis/output/proj_init

Then, run the projector warm-up:

bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step1_proj_warmup.sh \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/projector_init_weight \
  /path/to/vicuna_7b_dir/
 # example:
# bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step1_proj_warmup.sh \
#  ./blip2_pretrained_opt6.7b.pth \
#  ./lavis/output/proj_init/xxx.pth \
#  ./vicuna-7b/

2. Stage-2 Direct Fine-tuning
Please run:

bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step2_direct_finetune.sh \
  /path/to/blip2_opt6.7b_ckpt \
  /path/to/stage1_proj_warmup_checkpoint \
  /path/to/vicuna_7b_dir/
 # example:
# bash run_scripts/blip2/scale_up_train/llama_vpgtrans_step2_direct_finetune.sh \
#  ./blip2_pretrained_opt6.7b.pth \
#  ./lavis/output/vl-vicuna_stage1/checkpoint_0.pth \
#  ./vicuna-7b/

3. Stage-3 Visual Instruction Tuning
To align with conversation scenario, we conduct a short tuning using MiniGPT-4's self-instruct data (around 3,000 images). Please refer to instruction for downloading it. Please run:

bash run_scripts/blip2/scale_up_train/vicuna_vpgtrans_step3_self_instruct.sh \
  /path/to/stage2_direct_tuning_checkpoint \
  /path/to/vicuna_7b_dir/
 # example:
# bash run_scripts/blip2/scale_up_train/vicuna_vpgtrans_step3_self_instruct.sh \
#  ./lavis/output/vl-vicuna_stage2/checkpoint_50000.pth \
#  ./vicuna-7b/

Acknowledgement

  • Lavis This repository is built upon Lavis!
  • Vicuna We build the model based.
  • MiniGPT-4 The web UI and part of README are based on MiniGPT-4.

If you're using VPGTrans in your research or applications, please cite using this BibTeX:

@article{2023vpgtrans,
  author      = {Ao Zhang and Hao Fei and Yuan Yao and Wei Ji and Li Li and Zhiyuan Liu and Tat-Seng Chua},
  title       = {Transfer Visual Prompt Generator across LLMs},
  journal      = {CoRR},
  volume       = {abs/23045.01278},
  year         = {2023},
  url          = {https://doi.org/10.48550/arXiv.2305.01278},
}

License

BSD 3-Clause License

Popular repositories

  1. VPGTrans VPGTrans Public

    Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.

    Python 265 25

  2. VPGTrans.github.io VPGTrans.github.io Public

    JavaScript