SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

ACM MM 2023 Workshop Paper Code: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model.

Comparison of SW-CLIP and CLIP for zero-shot classification on ImageNet1K. The backbone of image encoder is RN50, and the model pre-trained on CC3M for 30 epochs.

method	text mask	text len	time	ImageNet1K
CLIP	original, 100.00%	32	1.00x	16.9%
SW-CLIP	SW, 42.30%	16	0.86x	17.2%

Data

We use the CC3M dataset for training. Then, you can generate the sub-sampling file by python3 src/data/subsampling.py.

Train the model

Train our model on SLURM: sbatch clip_run_experiment_cluster_das_train.sh.

torchrun --nproc_per_node=8 --master_port=25678 training/main.py \
  --save-frequency=1 \
  --report-to=tensorboard \
  --train-data="./path/to/cc3m_train.csv" \
  --imagenet-val="./path/to/imagenet_validation" \
  --csv-img-key=image \
  --csv-caption-key=caption \
  --model=RN50 \
  --batch-size=256 \
  --lr=1e-3 \
  --wd=0.1 \
  --epochs=30 \
  --workers=8 \
  --seed=42 \
  --local-loss \
  --gather-with-grad \
  --force-custom-text \
  --subsample \
  --name pretrain_cc3m_train_RN50_subsample

Fine-tune the model

Fine-tune our model without subsamlping frequent words on SLURM : sbatch clip_run_experiment_cluster_das_finetune.sh.

torchrun --nproc_per_node=8 --master_port=25698 training/main.py \
  --save-frequency=1 \
  --report-to=tensorboard \
  --zeroshot-frequency=1 \
  --train-data="../path/to/cc3m/cc3m_train.csv" \
  --imagenet-val="./path/to/imagenet_validation" \
  --csv-img-key=image \
  --csv-caption-key=caption \
  --model=RN50 \
  --pretrained="./path/to/checkpoints/epoch_K.pt" \
  --batch-size=768 \
  --warmup=125 \
  --lr=1e-3 \
  --wd=0.1 \
  --epochs=1 \
  --workers=8 \
  --seed=42 \
  --local-loss \
  --gather-with-grad \
  --force-custom-text \
  --name pretrain_cc3m_train_RN50_subsample_finetune

Evaluation

Test our model on SLURM. sbatch ml_run_with_slurm_das_test.sh

We upload our pre-trained model here. You can download them and put them into the model directory. Test the model by: sbatch clip_run_experiment_cluster_das_test.sh

python -u training/main.py \
  --report-to tensorboard \
  --imagenet-val="./path/to/imagenet_validation/" \
  --csv-img-key=image \
  --csv-caption-key=caption \
  --batch-size=256 \
  --workers=6 \
  --model=RN50 \
  --pretrained="./path/to/checkpoints/epoch_K.pt" \
  --seed=42 \
  --local-loss \
  --gather-with-grad \
  --force-custom-text

Citation

@inproceedings{swclip2023liang,
author = {Liang, Mingliang and Larson, Martha},
title = {Subsampling of Frequent Words in Text for Pre-Training a Vision-Language Model},
year = {2023},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications},
}

We borrow the code from "open_clip"

Name		Name	Last commit message	Last commit date
Latest commit History 447 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
tutorials		tutorials
.gitignore		.gitignore
CITATION.cff		CITATION.cff
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
clip_run_experiment_cluster_das_fine_tune.sh		clip_run_experiment_cluster_das_fine_tune.sh
clip_run_experiment_cluster_das_test.sh		clip_run_experiment_cluster_das_test.sh
clip_run_experiment_cluster_das_train.sh		clip_run_experiment_cluster_das_train.sh
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

MingliangLiang3/sw_clip

Folders and files

Latest commit

History

Repository files navigation

SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

Data

Train the model

Fine-tune the model

Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages