Date | Paper | Authors | Code | Demo | Comments |
---|---|---|---|---|---|
01.02.2024 | Visual Instruction Tuning | H. Liu, C. Li, Q. Wu, Y. J. Lee | GitHub Project Page | Demo | |
08.02.2024 | When and why vision-language models behave like bags-of-words, and what to do about it? | M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou | https://github.com/mertyg/vision-language-models-are-bows | Colab | Why did they expect that CLIP will take a word order into account given that CLIP is trained to match a bag-of-words with a corresponding image? |
22.02.2024 | Learning Transferable Visual Models From Natural Language Supervision | A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G.Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever | GitHub Project Page | Colab | See also open source implementation of CLIP; Scaling laws for contrastive language-image learning |
29.02.2024 | Continue | Fig. 2 is unclear. How do they obtain a vector for a bag-of-words? | |||
07.03.2024 | Still (sic!) continue | It seems that they train using BoW, even though their inference pipeline does not reflect this. | |||
14.03.2024 | Sigmoid Loss for Language Image Pre-Training | X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer | HuggingFace | ||
21.03.2024 | Continue | ||||
28.03.2024 | Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning | W. Liang, Y. Zhang, Y. Kwon, S. Yeung, J. Zou | GitHub Project Page | ||
04.04.2024 | What Makes Training Multi-modal Classification Networks Hard? | Wang, Tran, Feiszli | |||
11.04.2024 | MultiBench: Multiscale Benchmarks for Multimodal Representation Learning | Liang, Lyu, Fan, Wu, Cheng, Wu, Chen, Wu, Lee, Zhu, Salakhutdinaov, Morency | GitHub Project Page | Demos | |
16.04.2024 | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | Tong, Liu, Zhai, Ma, LeCun, Xie | GitHub Project Page | HuggingFace | |
23.04.2024 | Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies | Li, Xie, Cubuk | |||
30.04.2024 | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | Lu, Peng, Cheng, Galley, Chang, Wu, Zhu, Gao | GitHub, Project Page | ||
07.05.2024 | Many-Shot In-Context Learning | Agarwal, Singh, Zhang, Bohnet, Chan, Anand, Abbas, Nova, Co-Reyes, Chu, Behbahani, Faust, Larochelle | Not Provided | ||
28.05.2024 | BABILong: a long-context needle-in-a-haystack benchmark for LLMs | Kuratob, Bulatov, Anokhin, Sorokin, Sorokin, Burtsev | GitHub | ||
04.06.2024 | Continue | ||||
11.06.2024 | 4M: Massively Multimodal Masked Modeling | Mizrahi, Bachmann, Kar, Yeo, Gao, Dehghan, Zamir | GitHub Project Page | ||
18.06.2024 | Continue | ||||
25.06.2024 | GLaMM: Pixel Grounding Large Multimodal Model | Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, Khan | GitHub Project Page | Demo | |
02.07.2024 | Code Reading Group | ||||
09.07.2024 | Knowledge Distillation | Gemma 2 (pdf), MobileLLM, Knowledge distillation, On-Policy distillation of Language Models | |||
16.07.2024 | Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities | Menon, Zemel, Vondrick | Project Page | ||
23.07.2024 | Multimodal Neurons in Artificial Neural Networks | Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, Olah | |||
30.07.2024 | Continue + (very briefly) CLIPPO | ||||
06.08.2024 | Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! | Hessel, Lee | |||
13.08.2024 | Graph of Thoughts and Monte Carlo Tree Search | Monte Carlo Tree Search from Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B; Graph of Thoughts; Large Language Monkeys; STaR: Self-Taught Reasoner; Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents; Bonus! DeepSeek-Prover-V1.5 | Tinygrad example of MCTS | ||
15.10.2024 | |||||
22.10.2024 | Calibration Multimodal Learning | Ma, Zhang, Wu, Fu, Hu | |||
08.11.2024 | Towards Mamba: the S4 model and topic around: HiPPo, S4 paper, Annotated S4 blog post | Gu, Goel, Ré | GitHub | ||
15.11.2024 | Continue: HiPPO and S4 | ||||
22.11.2024 | Continue: Mamba & differences between Transformers and SSMs |
Datasets and benchmarks
Surveys
Representation Learning
Latent Space Structure
Fusion
- Liu et al. Visual Instruction Tuning
- Radford et al. Learning Transferable Visual Models From Natural Language Supervision
- Zhai et al. Sigmoid Loss for Language Image Pre-Training
- Nagrani et al. Attention Bottlenecks for Multimodal Fusion
- Baevski et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
- Recasens et al. Zorro: the masked multimodal transformer
- Jaegle et al. Perceiver: General Perception with Iterative Attention
- Liu et al. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space For Multi-Modal Retrieval
- Kwon et al. Masked Vision And Language Modeling For Multi-Modal Representation Learning
- Liang et al. High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
- Girdhar et al. OMNIVORE: A Single Model for Many Visual Modalities
- Shvetsova et al. Everything at Once – Multi-modal Fusion Transformer for Video Retrieval