CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

CLIP for Visual Grounding.

This repository is the official Pytorch implementation for the paper CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding.

If you have any questions, please feel free to open an issue or contact me with emails: xiaolinhui16@mails.ucas.ac.cn.

Links: IEEE Transaction on Multimedia (2023), ArXiv, [中文解读]

Please leave a STAR ⭐ if you like this project!

News

Update on 2024/04/20: We release an advanced version of CLIP-VG, namely HiVG (paper, github).
Update on 2023/12/13: All of the code, models and datasets have been released.
Update on 2023/9/25: Our paper has been accepted by the top journal IEEE Transaction on Multimedia (2023)!
Update on 2023/05/18: Release the repository and training code.

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{xiao2023clip,
  title={CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding},
  author={Xiao, Linhui and Yang, Xiaoshan and Peng, Fang and Yan, Ming and Wang, Yaowei and Xu, Changsheng},
  journal={IEEE Transactions on Multimedia},
  year={2023},
  publisher={IEEE}
}

Highlight

CLIP for Visual Grounding. a state-of-the-art baseline for unsupervised and fully supervised visual grounding with CLIP model.
Single-source and Multi-source pseudo-language labels. The generation and usage of multi-source pseudo-labels.
Self-paced Curriculum Adapting Algorithm. A plugin-like algorithmic idea that can be applied to any pseudo-label scenario.

TODO

Release model code and inference code.
Release unsupervised and fully supervised checkpoints.
Release the complete multi-source pseudo-language labels and its generation code.
Release the reliability measurement code.

Introduction

In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.

We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels.

Our method outperforms the current state-of-the-art unsupervised method Pseudo-Q, by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios. Furthermore, our approach even outperforms existing weakly supervised methods. In comparison with the fully supervised SOTA model QRNet, we achieve comparable results with only 7.7% of its updated parameters, while obtaining significant speedups in both training and inference, up to 26.84× and 7.41×, respectively.

In summary, the contributions of this work are four-fold:

As far as we know, we are the first to adapt CLIP to realize unsupervised visual grounding. Our method can transfer the cross-modal learning ability of CLIP to visual grounding with only a small training cost.
We first introduce the self-paced curriculum learning in unsupervised visual grounding. Our proposed reliability measurement and single-source self-paced adapting can progressively enhance the CLIP-based visual grounding model by utilizing pseudo-labels in an easy-to-hard learning paradigm.
We first propose the multi-source self-paced adapting algorithm to extend our method for accessing multiple sources of pseudo-labels, which can flexibly improve the diversity of language taxonomy.
We conduct extensive experiments to evaluate the effectiveness of our approach. Results show that our method obtains significant improvements in unsupervised setting and is also competitive in fully supervised setting.

For more details, please refer to our paper.

Usage

Dependencies

Python 3.9.10
PyTorch 1.9.0 + cu111 + cp39
Check requirements.txt for other dependencies.

Our model is easy to deploy in a variety of environments and has been successfully tested on multiple pytorch versions. If you are interested in the pseudo language label generation module, more detailed instructions will be found at usage instructions.

Image Data Preparation

1.You can download the images from the original source and place them in your disk folder, such as $/path_to_image_data:

MS COCO 2014 (for RefCOCO, RefCOCO+, RefCOCOg dataset, almost 13.0GB)
ReferItGame
Flickr30K Entities

We provide a script to download the mscoco2014 dataset, you just need to run the script in terminal with the following command:
```
bash download_mscoco2014.sh
```
Or you can also follow the data preparation of TransVG, which can be found in GETTING_STARTED.md.

Only the image data in these datasets is used, and these image data is easily find in similar repositories of visual grounding work, such as TransVG etc. Finally, the $/path_to_image_data folder will have the following structure:

|-- image_data
   |-- Flickr30k
      |-- flickr30k-images
   |-- other
      |-- images
        |-- mscoco
            |-- images
                |-- train2014
   |-- referit
      |-- images

$/path_to_image_data/image_data/Flickr30k/flickr30k-images/: Image data for the Flickr30K dataset, please download from this link. Fill the form and download the images.
$/path_to_image_data/image_data/other/images/: Image data for RefCOCO/RefCOCO+/RefCOCOg, i.e., mscoco2014.
$/path_to_image_data/image_data/referit/images/: Image data for ReferItGame.

Text-Box Anotations / Pseudo-Labels Prepare

The following are the pseudo-language labels generated by the pseudo-language label generation module in an unsupervised setting.

The single-source scenario includes a pseudo-template label derived from Pseudo-Q.

Multi-source scenario include pseudo-template labels, pseudo-relation labels, and pseudo-caption labels. Please refer to pseudo-language label generation module for specific details on how they are generated if interested.

Additionally, we also provide the pseudo-labels that selected through our single-source self-paced curriculum adapting (SSA) and multi-source self-paced curriculum adapting (MSA) algorithms, which can be conveniently and directly used by the following researchers.

The labels in the fully supervised scenario is consistent with previous works such as TransVG. It is worth noting that the test split in the unsupervised scenario are exactly the same as those used in the fully supervised scenario.

Unsupervised setting

Single-source scenario

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
original	All of six datasets, 36.7MB
with curriculum selecting	All of six datasets, 31.4MB

Multi-source scenario

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
original	All of six datasets, 144.7MB, each dataset contains 3 sources of pseudo-labels
with curriculum selecting	All of six datasets, 87.3MB, each dataset contains 3 sources of pseudo-labels

Fully supervised setting

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
url, size	All of six datasets, 89.0MB
with curriculum selecting	-	-	-	dataset	-	-

* Since we observed a relatively clear performance increase on the RefCOCOg-u dataset in the fully supervised setting, we provide data for this dataset after applying our SSA algorithm for curriculum selecting. Typically, by using this filtered data, there is an approximate ~1.0 increase in performance on both val-u and test-u.

Download the above annotations to a disk directory such as $/path_to_split; then will have the following similar directory structure:

|-- /unsup_single_source/unsup_single_source_ssa/
|-- unsup_multi_source/unsup_multi_source_msa/full_sup_data
    ├── flickr
    │   ├── flickr_test.pth
    │   ├── flickr_train_pseudo.pth
    │   └── flickr_val.pth
    ├── gref
    │   ├── gref_train_pseudo.pth
    │   └── gref_val.pth
    ├── gref_umd
    │   ├── gref_umd_test.pth
    │   ├── gref_umd_train_pseudo.pth
    │   └── gref_umd_val.pth
    ├── referit
    │   ├── referit_test.pth
    │   ├── referit_train_pseudo.pth
    │   └── referit_val.pth
    ├── unc
    │   ├── unc_testA.pth
    │   ├── unc_testB.pth
    │   ├── unc_train_pseudo.pth
    │   └── unc_val.pth
    └── unc+
        ├── unc+_testA.pth
        ├── unc+_testB.pth
        ├── unc+_train_pseudo.pth
        └── unc+_val.pth
    In multi-source, it have a additional train_separate directory for further research purpose. 
        ├── train_separate
            ├── 1_unc+_train_pseudo_template_0_5.pth
            │── 2_unc+_train_pseudo_relation_0_5.pth
            └── 3_unc+_train_pseudo_caption_0_5.pth

* The number at the end of the filename in train_separate directory represents the reliability threshold as defined in the paper.

Pre-trained Checkpoints

Unsupervised setting

Single-source scenario

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
url, size	All of six models, 3.0GB

* Note that the performance of our provided model on the refcocog-val-g dataset in the unsupervised single-source scenario is approximately higher ~2.0 than reported in the paper, i.e., (54.16) --> (56.46).

Multi-source scenario

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
url, size	All of six models, 3.0GB

Fully supervised setting

Datasets	RefCOCO	RefCOCO+	RefCOCOg-g	RefCOCOg-u	ReferIt	Flickr
separate	model	model	model	model	model	model
url, size	All of six models, 3.0GB

* Note that the performance of our provided model on the refcoco+ dataset in the fully supervised setting is approximately higher ~2.0 than reported in the paper, i.e., (69.55, 77.33, 57.62) --> (71.08, 79.17, 59.40).

Training and Evaluation

You just only need to change $/path_to_split, $/path_to_image_data, $/path_to_output to your own file directory to execute the following command. The first time we run the command below, it will take some time for the repository to download the CLIP model.

Training on RefCOCO with unsupervised setting.

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28887 --use_env train_clip_vg.py --num_workers 2 --epochs 110 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate      --imsize 224 --max_query_len 77 --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split      --output_dir $/path_to_output/output_v01/unc;

Please refer to train_and_eval_script/train_and_eval_unsup.sh for training commands on other datasets.

Training on RefCOCO with fully supervised setting. The only difference is an additional control flag: --sup_type full

CUDA_VISIBLE_DEVICES=3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=5 --master_port 28887 --use_env train_clip_vg.py --num_workers 32 --epochs 120 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate    --imsize 224 --max_query_len 77  --sup_type full --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;

Please refer to train_and_eval_script/train_and_eval_full_sup.sh for training commands on other datasets.

Evaluation on RefCOCO. The instructions are the same for the unsupervised and fully supervised Settings.

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;

Please refer to train_and_eval_script/train_and_eval_unsup.sh for evaluation commands on other splits or datasets.

We strongly recommend to use the following commands to training or testing with different datasets and splits, which will significant reduce the training workforce.
```
bash train_and_eval_script/train_and_eval_unsup.sh   
bash train_and_eval_script/train_and_eval_full_sup.sh
```

Curriculum reliability measurement or scoring for the pseudo-language labels:

It is only needs to change eval.py to eval_for_reliability_distribution.py and rename the training pseudo labels as test.pth in the corresponding datasets during Evaluation:

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval_for_reliability_distribution.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;

Besides, if you need to merge the pseudo train splits for further research, just running the following command:

python ./pseudo_label_generation_module/utils/merge_file.py $/path_to_split/unsup_multi_source/unc/train_separate unc;
cp $/path_to_split/full_sup_data/unc/unc_val.pth $/path_to_split/unsup_multi_source/unc/train_separate/unc/unc_val.pth

Then, you can construct a new pseudo-label training split.

Results

RefCOCO, RefCOCO+, and RefCOCOg datasets

ReferIt and Flickr datasets

Our model also has significant energy efficiency advantages.

Compared to QRNet, we updated only 7.7% of its parameters and achieved impressive training and inference speedups, up to 26.84× and 7.41×, respectively, while also obtaining competitive results.

Methods

Visualization

The figure presents the histograms of Single-Source Reliability (SR) and Cross-source Reliability (CR) for pseudo-language labels in the range of (0.0, 1.0] with 1000 bins, where each bin represents the number of samples. The figure illustrates that different sources exhibit distinct distributions due to their specific quality and language taxonomy of pseudo-language labels (e.g., Fig.5-(a1)-(b2)-(c3)), while different reliability measures have varying discrimination abilities on the same source (e.g., Fig.5-(a1)-(b1)-(c1)).

Before the execution of MSA, the distribution of the pseudo-language labels and the ground-truth query labels is quite different, but after the execution of MSA, the distribution discrepancy significantly becomes smaller. This shows that MSA can effectively select pseudo-labels that are more reliable or closer to the distribution of ground-truth query labels.

Among the various types of unreliable pseudo-language labels, referring to ambiguity is more frequent, particularly in images with similar classification objects. If future research aims to further enhance model performance, addressing ambiguity is a critical issue.

Contacts

Email: xiaolinhui16@mails.ucas.ac.cn. Any kind discussions are welcomed!

Acknowledgement

Our model is related to CLIP, Pseudo-Q, TransVG. Thanks for their great work!

We also thank the great previous work including DETR, QRNet, M2, CLIPCap, RelTR, bottom-up-attention, ReSC, etc.

Thanks OpenAI for their awesome models.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
datasets		datasets
docs		docs
models		models
pseudo_label_generation_module		pseudo_label_generation_module
train_and_eval_script		train_and_eval_script
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_mscoco2014.sh		download_mscoco2014.sh
engine.py		engine.py
eval.py		eval.py
eval_for_reliability_distribution.py		eval_for_reliability_distribution.py
requirements.txt		requirements.txt
train_clip_vg.py		train_clip_vg.py

License

linhuixiao/CLIP-VG

Folders and files

Latest commit

History

Repository files navigation

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Links: IEEE Transaction on Multimedia (2023), ArXiv, [中文解读]

News

Citation

Contents

Highlight

TODO

Introduction

Usage

Dependencies

Image Data Preparation

Text-Box Anotations / Pseudo-Labels Prepare

Unsupervised setting

Single-source scenario

Multi-source scenario

Fully supervised setting

Pre-trained Checkpoints

Unsupervised setting

Single-source scenario

Multi-source scenario

Fully supervised setting

Training and Evaluation

Results

Methods

Visualization

Contacts

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages