ACM MM 2023 Workshop Paper Code: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model.
Comparison of SW-CLIP and CLIP for zero-shot classification on ImageNet1K. The backbone of image encoder is RN50, and the model pre-trained on CC3M for 30 epochs.
method | text mask | text len | time | ImageNet1K |
---|---|---|---|---|
CLIP | original, 100.00% | 32 | 1.00x | 16.9% |
SW-CLIP | SW, 42.30% | 16 | 0.86x | 17.2% |
We use the CC3M dataset for training.
Then, you can generate the sub-sampling file by python3 src/data/subsampling.py
.
Train our model on SLURM: sbatch clip_run_experiment_cluster_das_train.sh
.
torchrun --nproc_per_node=8 --master_port=25678 training/main.py \
--save-frequency=1 \
--report-to=tensorboard \
--train-data="./path/to/cc3m_train.csv" \
--imagenet-val="./path/to/imagenet_validation" \
--csv-img-key=image \
--csv-caption-key=caption \
--model=RN50 \
--batch-size=256 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text \
--subsample \
--name pretrain_cc3m_train_RN50_subsample
Fine-tune our model without subsamlping frequent words on SLURM : sbatch clip_run_experiment_cluster_das_finetune.sh
.
torchrun --nproc_per_node=8 --master_port=25698 training/main.py \
--save-frequency=1 \
--report-to=tensorboard \
--zeroshot-frequency=1 \
--train-data="../path/to/cc3m/cc3m_train.csv" \
--imagenet-val="./path/to/imagenet_validation" \
--csv-img-key=image \
--csv-caption-key=caption \
--model=RN50 \
--pretrained="./path/to/checkpoints/epoch_K.pt" \
--batch-size=768 \
--warmup=125 \
--lr=1e-3 \
--wd=0.1 \
--epochs=1 \
--workers=8 \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text \
--name pretrain_cc3m_train_RN50_subsample_finetune
Test our model on SLURM.
sbatch ml_run_with_slurm_das_test.sh
We upload our pre-trained model here. You can download them and put them into the model directory.
Test the model by: sbatch clip_run_experiment_cluster_das_test.sh
python -u training/main.py \
--report-to tensorboard \
--imagenet-val="./path/to/imagenet_validation/" \
--csv-img-key=image \
--csv-caption-key=caption \
--batch-size=256 \
--workers=6 \
--model=RN50 \
--pretrained="./path/to/checkpoints/epoch_K.pt" \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text
@inproceedings{swclip2023liang, author = {Liang, Mingliang and Larson, Martha}, title = {Subsampling of Frequent Words in Text for Pre-Training a Vision-Language Model}, year = {2023}, publisher = {Association for Computing Machinery}, booktitle = {Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications}, }
We borrow the code from "open_clip"