Skip to content

hoannc0506/Visual-Question-Answering

Repository files navigation

Visual Question Answering project

Dataset

VQA COCO dataset

Pipeline

Pipeline

CNN + LSTM approach

  • Image Encoder: ResNet50
  • Text Encoder: BiLSTM

Tranformers approach

  • Image Encoder: Vision Transformer, ViTMAE
  • Text Encoder: RoBERTa base model

Train scripts

python train_vqa_basic_trainer.py \
--visual-pretrained "google/vit-base-patch16-224" \
--text-pretrained "roberta-base" \
--device "cuda:0"

Train results

Models Val acc Test acc
ResNet50 + LSTM 0.5358 -
VisTrans + RoBERTa (pooler_output) 0.6690 0.6636
VisTrans + RoBERTa (last_hidden_state output) 0.6931 0.6874

To do

  • Public models
  • Inference code
  • Compare with other models

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages