Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz • Yair Kittenplon • Aviad Aberdam • Elad Ben Avraham

Installation

First, clone this repository:

git clone https://github.com/amazon-science/QA-ViT.git
cd QA-ViT

Next, to install the requirements in a new conda environment, run:

conda env create -f qavit.yml
conda activate qavit

Data preparation

Download the following datasets from the official websites, and organize them as follows:

QA-ViT
├── configs
│   ├── ...
├── data
│   ├── textvqa
│   ├── stvqa
│   ├── OCRVQA
│   ├── vqav2
│   ├── vg
│   ├── textcaps
│   ├── docvqa
│   ├── infovqa
│   ├── vizwiz
├── models
│   ├── ...
├── ...

DeepSpeed Configuration

Our framework is based on deepspeed stage 2 and should be configured accordingly:

accelerate config

The accelerate config opens a dialog and should be set as follows:

Model	DeepSpeed stage	Grad accumulation	Grad clipping	Dtype
ViT+T5 base	2	❌	❌	bf16
ViT+T5 large	2	❌	❌	bf16
ViT+T5 xl	2	2	❌	bf16

Training

Training script and instructions will be available soon.

Evaluation

After setting up DeepSpeed, run the following command to evaluate a trained model:

accelerate launch run_eval.py --config <config> --ckpt <ckpt>

where <config> and <ckpt> specify the desired evaluation configuration and trained model checkpoint, respectively.

Trained Checkpoints

We provide trained checkpoints of QA-ViT in the table below:

ViT+T5 base	ViT+T5 large	ViT+T5 xl
Download	Download	Download

LLaVA's checkpoints will be uploaded soon.

Main Results

Method	VQA^v2 vqa-score	COCO CIDEr	VQA^T vqa-score	VQA^ST ANLS	TextCaps CIDEr	VizWiz vqa-score	General Average	Scene-Text Average
ViT+T5-base	66.5	110.0	40.2	47.6	86.3	23.7	88.3	65.1
+ QA-ViT	71.7	114.9	45.0	51.1	96.1	23.9	93.3	72.1
Δ	+5.2	+4.9	+4.8	+3.5	+9.8	+0.2	+5.0	+7.0
ViT+T5-large	70.0	114.3	44.7	50.6	96.0	24.6	92.2	71.8
+ QA-ViT	72.0	118.7	48.7	54.4	106.2	26.0	95.4	78.9
Δ	+2.0	+4.4	+4.0	+3.8	+10.2	+1.4	+3.2	+7.1
ViT+T5-xl	72.7	115.5	48.0	52.7	103.5	27.0	94.1	77.0
+ QA-ViT	73.5	116.5	50.3	54.9	108.2	28.3	95.0	80.4
Δ	+0.8	+1.0	+2.3	+2.2	+4.7	+1.3	+0.9	+3.4

Citation

If you find this code or data to be useful for your research, please consider citing it.

@article{ganz2024question,
  title={Question Aware Vision Transformer for Multimodal Reasoning},
  author={Ganz, Roy and Kittenplon, Yair and Aberdam, Aviad and Avraham, Elad Ben and Nuriel, Oren and Mazor, Shai and Litman, Ron},
  journal={arXiv preprint arXiv:2402.05472},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
figs		figs
models		models
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Config		Config
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
code_utils.py		code_utils.py
get_model.py		get_model.py
qavit.yml		qavit.yml
run_eval.py		run_eval.py
textvqa_evaluation.py		textvqa_evaluation.py

License

amazon-science/QA-ViT

Folders and files

Latest commit

History

Repository files navigation

Question Aware Vision Transformer for Multimodal Reasoning

Installation

Data preparation

DeepSpeed Configuration

Training

Evaluation

Trained Checkpoints

Main Results

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages