CLIP Papers

This repository contains a comprehensive collection of the most important papers related to contrastive pretraining for vision, language, and audio. The papers are organized categorically, and sorted by year and month of publication.

Contrastive Language-Image Pretraining (CLIP)

The following table contains a list of papers that are directly related to CLIP, or that extend CLIP in some way, such as by improving the training process, or by changing the data filtering process. Every entry in this table is distinguished by contrastive learning being the primary pretraining objective, as opposed to models than employ multiple pretraining objectives, combining contrastive learning with other pretraining objectives masked language modeling (MLM).

Model	Year	Month	Paper Title	Novel Development	Open Source	License	Model Card	OpenCLIP Integration
CLIP	2021	2	Learning Transferable Visual Models From Natural Language Supervision	Simplified Contrastive Language-Image Pretraining	✔️	License	Model Card	✔️
ALIGN	2021	2	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	Extend from captions to noisy alt-text to avoid expensive filtering and post-processing	✔️		Model Card	❌
CLOOB	2021	10	CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP	Avoid saturation of InfoNCE objective	✔️	License		❌
DeCLIP	2021	10	Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm	Data efficiency through supervision	✔️	License		❌
FILIP	2021	11	FILIP: Fine-grained Interactive Language-Image Pre-Training	Adds token-wise maximum similarity bewteen visual and textual features for efficient and fine-grained semantic alignment	✔️			❌
DeFILIP	2022	3	Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision	Combines DeCLIP and FILIP	✔️	License		❌
PyramidCLIP	2022	4	PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining	Relax assumption that image and metadata are in one-to-one correspondence	❌			❌
KLITE	2022	4	K-LITE: Learning Transferable Visual Models with External Knowledge	Augment caption text with external knowledge	✔️	License		❌
CyCLIP	2022	5	CyCLIP: Cyclic Contrastive Language-Image Pretraining	Formalize and optimize for geometric consistency in image and text spaces	✔️	License		❌
FLIP	2022	12	Scaling Language-Image Pre-training via Masking	Masking images prior to encoding improves speed-accuracy trade-off for CLIP	✔️	License		❌
OpenCLIP	2022	12	Reproducible scaling laws for contrastive language-image learning	Open-source implementation of CLIP	✔️	License	Model Card	✔️
EVA-CLIP	2023	3	EVA-CLIP: Improved Training Techniques for CLIP at Scale	Improved representation learning, optimization, and augmentation for faster training	✔️		Model Card	✔️
SigLIP	2023	3	Sigmoid Loss for Language Image Pre-Training	Sigmoid loss allows disentangling loss from batch size	✔️	License		✔️
CLIPA	2023	5	An Inverse Scaling Law for CLIP Training	Insight into relationship between encoder size and training input sequence lengths leads to more efficient training	✔️	License		✔️
MetaCLIP	2023	9	Demystifying CLIP Data	Rigorous study to reveal CLIP's data curation process	✔️	License		✔️
DFN	2023	11	Data Filtering Networks	A model trained on high-quality data can be used to filter massive online data employed to train the final CLIP model	✔️	License	Model Card	✔️

CLIP + Additional Pretraining Objectives

Models that extend CLIP by adding additional pretraining objectives, such as masked language modeling (MLM).

The acronyms used in the table below are as follows:

DR: Dataset Reinforcement
H-ITC: Hierarchical Image-Text Contrastive
ISS: Image Self-Supervision
ITM: Image-Text Matching
LM: Language Modeling
MIM: Masked Image Modeling
MLM: Masked Language Modeling
MMM: Masked Multimodal Modeling
MSD: Masked Self-Distillation

All models in this table also use CLIP-style contrastive learning as a pretraining objective.

Model	Year	Month	Paper Title	Pretraining Techniques	Open Source	License
SLIP	2021	12	SLIP: Self-supervision meets Language-Image Pre-training	ISS	✔️	License
FLAVA	2021	12	FLAVA: A Foundational Language And Vision Alignment Model	ITM+MMM+MIM+MLM	✔️	License
BLIP	2022	1	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ITM+LM	✔️	License
MaskCLIP	2022	8	MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining	MLM+MSD	❌
ViCHA	2022	8	Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment	H-ITC+ITM+MMM+MIM+MLM	✔️	License
RILS	2023	1	RILS: Masked Visual Reconstruction in Language Semantic Space	MIM	❌
MobileCLIP	2023	11	MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training	MMR	✔️	License

Contrastive Pretraining for Other Modalities

This section contains collections of papers that are related to contrastive pretraining for other modalities, such as audio, video, and 3D data.

Audio

Models that use CLIP-style contrastive learning as a pretraining objective for audio.

Model	Year	Month	Paper Title	Modalities	Open Source	License
AudioCLIP	2021	6	AudioCLIP: Extending CLIP to Image, Text and Audio	audio+image+text	✔️	License
WAV2CLIP	2021	10	WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP	audio+image+text	✔️	License
SpeechCLIP	2022	10	SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model	speech+image+text	✔️	License
CLAP	2023	4	Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation	audio+text	✔️	License
CLVP	2023	5	Better speech synthesis through scaling	speech+text	✔️	License

Video

Models that extend CLIP to the video domain.

Model	Year	Month	Paper Title	Open Source	License
CLIP4Clip	2021	4	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✔️	License
VideoCLIP	2021	9	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	✔️	License
X-CLIP	2022	7	X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✔️	License

3D

Models that extend CLIP to the 3D domain.

Model	Year	Month	Paper Title	Modalities	Open Source
PointCLIP	2021	12	PointCLIP: Point Cloud Understanding by CLIP	point cloud + text	✔️
CLIP2Point	2022	10	CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training	point cloud + text	✔️
PointCLIPV2	2022	11	PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning	point cloud + text	❌
CLIP2	2023	3	CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data	point cloud + image + text	❌

👋 Contributing

Contributions are welcome! Submit a pull request to add a new paper, or to update an existing paper. Please follow the format of the existing papers in the table 😄

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scripts

scripts

README.md

README.md

Repository files navigation

CLIP Papers

Contrastive Language-Image Pretraining (CLIP)

CLIP + Additional Pretraining Objectives

Contrastive Pretraining for Other Modalities

Audio

Video

3D

👋 Contributing

About

Releases

Packages

Languages

jacobmarks/awesome-clip-papers

Folders and files

Latest commit

History

Repository files navigation

CLIP Papers

Contrastive Language-Image Pretraining (CLIP)

CLIP + Additional Pretraining Objectives

Contrastive Pretraining for Other Modalities

Audio

Video

3D

👋 Contributing

About

Topics

Resources

Stars

Watchers

Forks

Languages