Skip to content

RajGothi/Visual-Entities-Empowered-Zero-Shot-Image-to-Text-Generation-Transfer-Across-Domains

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

Problem statement:

  • Given an image I, the goal is to generate a textual description using a pre-trained Vision-Language Model (VLM) while leveraging real-world knowledge from a Large Language Model (LLM).
  • The primary focus is on addressing challenges related to modality bias and object hallucination.

Problem

Method:

method

  • Training: With text-only corpus, nouns are extracted from the sentence by a grammar parser to construct the hard prompt. Then, the soft prompt encodes the overall contexts of the sentence by CLIP text encoder.
  • Inference: CLIP Image encoding pass to projector which gives soft prompt, along with that CLIP-based entity classifier to construct the entity-aware hard prompt. With the strong transferability from the training-agnostic hard prompt.

Analysis:

method

Code:

Requirements:

pip install .
git clone https://github.com/tylin/coco-caption

Data Preparation:

cd Code/utils/
python get_entities.py
cd Code/Feature_Extraction/
python CLIP_texts_features_extraction.py
python CLIP_images_features_extraction.py
cd Code/utils/
python prompt_ensemble.py

Training:

cd scripts/
bash train_coco.sh 0
bash train_flickr30k.sh 0

where 0 represent the GPU number.

Inference:

To run the model on single image, you can use Notebook

Evaluation:

  • Cross-Domain
bash eval_nocaps.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_flickr30k.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_coco.sh flicker30K_1 0 '--top_k 3 --threshold 0.2 --using_greedy_search' 29
  • In-Domain
bash eval_coco.sh coco_train_1 0 '' 14
bash eval_flickr30k.sh flicker30K_1 0 '' 29

Other: