Skip to content

The most impactful papers related to contrastive pretraining for multimodal models!

Notifications You must be signed in to change notification settings

jacobmarks/awesome-clip-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

CLIP Papers

This repository contains a comprehensive collection of the most important papers related to contrastive pretraining for vision, language, and audio. The papers are organized categorically, and sorted by year and month of publication.

Contrastive Language-Image Pretraining (CLIP)

The following table contains a list of papers that are directly related to CLIP, or that extend CLIP in some way, such as by improving the training process, or by changing the data filtering process. Every entry in this table is distinguished by contrastive learning being the primary pretraining objective, as opposed to models than employ multiple pretraining objectives, combining contrastive learning with other pretraining objectives masked language modeling (MLM).

Model Year Month Paper Title Novel Development Arxiv Github Open Source License Model Card OpenCLIP Integration
CLIP 2021 2 Learning Transferable Visual Models From Natural Language Supervision Simplified Contrastive Language-Image Pretraining arXiv GitHub ✔️ License Model Card ✔️
ALIGN 2021 2 Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Extend from captions to noisy alt-text to avoid expensive filtering and post-processing arXiv ✔️ Model Card
CLOOB 2021 10 CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP Avoid saturation of InfoNCE objective arXiv GitHub ✔️ License
DeCLIP 2021 10 Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm Data efficiency through supervision arXiv GitHub ✔️ License
FILIP 2021 11 FILIP: Fine-grained Interactive Language-Image Pre-Training Adds token-wise maximum similarity bewteen visual and textual features for efficient and fine-grained semantic alignment arXiv ✔️
DeFILIP 2022 3 Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision Combines DeCLIP and FILIP arXiv GitHub ✔️ License
PyramidCLIP 2022 4 PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining Relax assumption that image and metadata are in one-to-one correspondence arXiv
KLITE 2022 4 K-LITE: Learning Transferable Visual Models with External Knowledge Augment caption text with external knowledge arXiv GitHub ✔️ License
CyCLIP 2022 5 CyCLIP: Cyclic Contrastive Language-Image Pretraining Formalize and optimize for geometric consistency in image and text spaces arXiv GitHub ✔️ License
FLIP 2022 12 Scaling Language-Image Pre-training via Masking Masking images prior to encoding improves speed-accuracy trade-off for CLIP arXiv GitHub ✔️ License
OpenCLIP 2022 12 Reproducible scaling laws for contrastive language-image learning Open-source implementation of CLIP arXiv GitHub ✔️ License Model Card ✔️
EVA-CLIP 2023 3 EVA-CLIP: Improved Training Techniques for CLIP at Scale Improved representation learning, optimization, and augmentation for faster training arXiv GitHub ✔️ Model Card ✔️
SigLIP 2023 3 Sigmoid Loss for Language Image Pre-Training Sigmoid loss allows disentangling loss from batch size arXiv GitHub ✔️ License ✔️
CLIPA 2023 5 An Inverse Scaling Law for CLIP Training Insight into relationship between encoder size and training input sequence lengths leads to more efficient training arXiv GitHub ✔️ License ✔️
MetaCLIP 2023 9 Demystifying CLIP Data Rigorous study to reveal CLIP's data curation process arXiv GitHub ✔️ License ✔️
DFN 2023 11 Data Filtering Networks A model trained on high-quality data can be used to filter massive online data employed to train the final CLIP model arXiv ✔️ License Model Card ✔️

CLIP + Additional Pretraining Objectives

Models that extend CLIP by adding additional pretraining objectives, such as masked language modeling (MLM).

The acronyms used in the table below are as follows:

  • DR: Dataset Reinforcement
  • H-ITC: Hierarchical Image-Text Contrastive
  • ISS: Image Self-Supervision
  • ITM: Image-Text Matching
  • LM: Language Modeling
  • MIM: Masked Image Modeling
  • MLM: Masked Language Modeling
  • MMM: Masked Multimodal Modeling
  • MSD: Masked Self-Distillation

All models in this table also use CLIP-style contrastive learning as a pretraining objective.

Model Year Month Paper Title Pretraining Techniques Arxiv Github Open Source License
SLIP 2021 12 SLIP: Self-supervision meets Language-Image Pre-training ISS arXiv GitHub ✔️ License
FLAVA 2021 12 FLAVA: A Foundational Language And Vision Alignment Model ITM+MMM+MIM+MLM arXiv GitHub ✔️ License
BLIP 2022 1 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ITM+LM arXiv GitHub ✔️ License
MaskCLIP 2022 8 MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining MLM+MSD arXiv GitHub
ViCHA 2022 8 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment H-ITC+ITM+MMM+MIM+MLM arXiv GitHub ✔️ License
RILS 2023 1 RILS: Masked Visual Reconstruction in Language Semantic Space MIM arXiv GitHub
MobileCLIP 2023 11 MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training MMR arXiv ✔️ License

Contrastive Pretraining for Other Modalities

This section contains collections of papers that are related to contrastive pretraining for other modalities, such as audio, video, and 3D data.

Audio

Models that use CLIP-style contrastive learning as a pretraining objective for audio.

Model Year Month Paper Title Modalities Arxiv Github Open Source License
AudioCLIP 2021 6 AudioCLIP: Extending CLIP to Image, Text and Audio audio+image+text arXiv GitHub ✔️ License
WAV2CLIP 2021 10 WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP audio+image+text arXiv GitHub ✔️ License
SpeechCLIP 2022 10 SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model speech+image+text arXiv GitHub ✔️ License
CLAP 2023 4 Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation audio+text arXiv GitHub ✔️ License
CLVP 2023 5 Better speech synthesis through scaling speech+text arXiv GitHub ✔️ License

Video

Models that extend CLIP to the video domain.

Model Year Month Paper Title Arxiv Github Open Source License
CLIP4Clip 2021 4 CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval arXiv GitHub ✔️ License
VideoCLIP 2021 9 VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding arXiv GitHub ✔️ License
X-CLIP 2022 7 X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval arXiv GitHub ✔️ License

3D

Models that extend CLIP to the 3D domain.

Model Year Month Paper Title Modalities Arxiv Github Open Source License
PointCLIP 2021 12 PointCLIP: Point Cloud Understanding by CLIP point cloud + text arXiv GitHub ✔️
CLIP2Point 2022 10 CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training point cloud + text arXiv GitHub ✔️
PointCLIPV2 2022 11 PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning point cloud + text arXiv GitHub
CLIP2 2023 3 CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data point cloud + image + text arXiv

👋 Contributing

Contributions are welcome! Submit a pull request to add a new paper, or to update an existing paper. Please follow the format of the existing papers in the table 😄

About

The most impactful papers related to contrastive pretraining for multimodal models!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages