Skip to content

vincentlux/Awesome-Multimodal-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 

Repository files navigation

Awesome-Multimodal-LLM Awesome

A curated list of papers related to multi-modal machine learning, especially multi-modal large language models (LLMs).

Table of Contents

Tutorials

Recent Advances in Vision Foundation Models, CVPR 2023 Workshop [pdf]

Datasets

M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning, arxiv 2023 [data]

LLaVA Instruction 150K, arxiv 2023 [data]

Youku-mPLUG 10M, arxiv 2023 [data]

MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning, ACL 2023 [data]

Research Papers

Survey Papers

A Survey on Multimodal Large Language Models, arxiv 2023 [project page]

Vision-Language Models for Vision Tasks: A Survey, arxiv 2023

Core Areas

Multimodal Understanding

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic, arxiv 2023 [code]

PandaGPT: One Model To Instruction-Follow Them All, arxiv 2023 [code]

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic, arxiv 2023 [code]

MIMIC-IT: Multi-Modal In-Context Instruction Tuning, arxiv 2023 [code]

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, arxiv 2023 [code]

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, arxiv 2023

mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality, arxiv 2023 [code]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, arxiv 2023 [code]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, ICML 2023 [code]

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arxiv 2023 [code]

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans, arxiv 2023 [code]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arxiv 2023 [code]

Language Is Not All You Need: Aligning Perception with Language Models, arxiv 2023 [code]

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, arxiv 2023 [code]

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages, arxiv 2023 [code]

Visual Instruction Tuning, arxiv 2023 [code]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, arxiv 2023 [code]

PaLI: A Jointly-Scaled Multilingual Language-Image Model, ICLR 2023 [blog]

Grounding Language Models to Images for Multimodal Inputs and Outputs, ICML 2023 [code]

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 2022 [code]

Flamingo: a Visual Language Model for Few-Shot Learning, NeurIPS 2022

Vision-Centric Understanding

LISA: Reasoning Segmentation via Large Language Model, arxiv 2023 [code]

Contextual Object Detection with Multimodal Large Language Models, arxiv 2023 [code]

KOSMOS-2: Grounding Multimodal Large Language Models to the World, arxiv 2023 [code]

Contextual Object Detection with Multimodal Large Language Models, arxiv 2023 [code]

Fast Segment Anything, arxiv 2023 [code]

Multi-Modal Classifiers for Open-Vocabulary Object Detection, ICML 2023 [code]

Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT, arxiv 2023

Images Speak in Images: A Generalist Painter for In-Context Visual Learning, arxiv 2023 [code]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding, arxiv 2023 [code]

SegGPT: Segmenting Everything In Context, arxiv 2023 [code]

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, arxiv 2023 [code]

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching, arxiv 2023

Personalize Segment Anything Model with One Shot, arxiv 2023 [code]

Segment Anything, arxiv 2023 [code]

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks, CVPR 2023 [code]

A Generalist Framework for Panoptic Segmentation of Images and Videos, arxiv 2022

A Unified Sequence Interface for Vision Tasks, NeurIPS 2022 [code]

Pix2seq: A language modeling framework for object detection, ICLR 2022 [code]

Embodied-Centric Understanding

Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, arxiv 2023 [code]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, preprint [project page]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, arxiv 2023 [project page]

MotionGPT: Human Motion as a Foreign Language, arxiv 2023 [code]

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, arxiv 2023 [code]

PaLM-E: An Embodied Multimodal Language Model, arxiv 2023 [blog]

Generative Agents: Interactive Simulacra of Human Behavior, arxiv 2023

Vision-Language Models as Success Detectors, arxiv 2023

TidyBot: Personalized Robot Assistance with Large Language Models, arxiv 2023 [code]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, CoRL 2022 [blog] [code]

Domain-Specific Models

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, arxiv 2023 [code]

Multimodal Evaluation

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, arxiv 2023 [code]

LOVM: Language-Only Vision Model Selection, arxiv 2023 [code]

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation, arxiv 2023 [project page]