BLIP2-Japanese

This project builds upon LAVIS library's BLIP2 mdoel.

The main idea is to replace the tokenizer and the underlying BERT model in Blip2's Qformer with the one trained on Japanese datasets and retrain the upated model on Japanese captioning datasets.

The model has been trained using COCO dataset with STAIR captions.

Quick Start

The weights of Blip2_Japanese_qformer trained on STAIR can be obtained from this link.

Copy the whole folder under lavis directory, make sure the directory is called pretrained.

Moreover, download bert-base-japanese-whole-word-masking weights and config from the hugging face link

You should now be able to run the example.ipynb notebook.

For directory naming conventions, you can also refer to the .gitignore file.

Use Case: Generate Japanese Captions for Captioning Datasets

Captions generated for flickr30k dataset can be found in flickr30k_caption.json. Script in flickr30k_caption_generate.ipynb.

These captions are generated using top-k sampling instead of nucleus.

Captions generated by the pretrained and finetuned models are shown below:

pretrained: {'image': '1001773457.jpg', 'caption': ['二匹の犬が道路でフリスビーをしている']} # No frisbee

finetuned: {'image': '1001773457.jpg', 'caption': ['二匹の犬が道路で喧嘩をしている']}

pretrained: {'image': '1001573224.jpg', 'caption': ['6 人の女性が屋内で飛び跳ねている']} # Wrong head count

finetuned: {'image': '1001573224.jpg', 'caption': ['黒い服を着た女性たちが飛び跳ねている']}

In general, captions generated by the finetuned model are more accurate.

Use Case: Image Retrieval

Refer to the example.ipynb notebooks for more details. The idea is to get the average cosine similarity of query tokens between the image embeddings and the multimodal embeddings.

Model training

The model was trained on a single GTX4080 GPU(laptop). Hence the config during training is modified as follows:

In blip2_pretrain.yaml: vit_precision = 'fp16'

In pretrain_stage1.yaml: batch_size = 25

During evaluation you have to change vit_precision back to fp32.

The pretrained and finetuned weights may be updated without prior notice. So if you cannot reproduce the results in the exmaple notebook, please re-download the weights and try again.

User Interface for Japanese Caption Generator

A simple interface for demo purpose can be found in generator-ui.py. To run the UI:

   python generator-ui.py

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
lavis		lavis
run_scripts		run_scripts
samples		samples
tests/models		tests/models
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
evaluate.py		evaluate.py
example.ipynb		example.ipynb
flickr30k_caption_finetune.json		flickr30k_caption_finetune.json
flickr30k_caption_generate.ipynb		flickr30k_caption_generate.ipynb
flickr30k_caption_pretrain.json		flickr30k_caption_pretrain.json
generator-ui.py		generator-ui.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

License

ZhaoPeiduo/BLIP2-Japanese

Folders and files

Latest commit

History

Repository files navigation

BLIP2-Japanese

Quick Start

Use Case: Generate Japanese Captions for Captioning Datasets

Use Case: Image Retrieval

Model training

User Interface for Japanese Caption Generator

About

Topics

Resources

License

Stars

Watchers

Forks

Languages