A curated list of awesome Multimodal studies.
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | arXiv | 2024-06-18 | - | |
LOVA3: Learning to Visual Question Answering, Asking and Assessment | arXiv | 2024-05-23 | - | |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI | arXiv | 2024-04-24 | ||
BLINK: Multimodal Large Language Models Can See but Not Perceive | arXiv | 2024-04-18 | ||
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench) | ICLR 2024 | 2023-10-11 | - | |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) | arXiv | 2023-09-25 | ||
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | CVPR 2024 | 2023-07-30 | - |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC) | NAACL 2024 | 2024-04-16 | - |
Title | Venue | Date | Code | Supplement |
---|
| MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira) | arXiv | 2024-07-08 | | - |
| VIMI: Grounding Video Generation through Multi-modal Instruction | arXiv | 2024-07-08 |
|
|
| OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation | arXiv | 2024-07-02 | - | - |
| Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any) | | 2024-05-09 |
|
|
| StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation) | arXiv | 2024-03-21 |
|
|
| AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks | arXiv | 2024-03-21 |
|
|
| FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu) | CVPR 2024 | 2024-03-19 |
|
|
| Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu) | arXiv | 2024-01-05 |
|
|
| FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu) | arXiv | 2023-12-12 |
|
|
| VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu) | arXiv | 2023-12-01 |
|
|
| VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu) | CVPR 2024 | 2023-11-29 |
|
|
| Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD) | arXiv | 2023-11-25 |
|
|
| SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu) | ICLR 2024 | 2023-10-31 |
|
|
| FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu) | ICLR 2024 | 2023-10-23 |
|
|
| LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu) | | 2023-09-26 |
|
|