multimodal-reading-group

Date	Paper	Authors	Code	Demo	Comments
01.02.2024	Visual Instruction Tuning	H. Liu, C. Li, Q. Wu, Y. J. Lee	GitHub Project Page	Demo
08.02.2024	When and why vision-language models behave like bags-of-words, and what to do about it?	M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou	https://github.com/mertyg/vision-language-models-are-bows	Colab	Why did they expect that CLIP will take a word order into account given that CLIP is trained to match a bag-of-words with a corresponding image?
22.02.2024	Learning Transferable Visual Models From Natural Language Supervision	A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G.Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever	GitHub Project Page	Colab	See also open source implementation of CLIP; Scaling laws for contrastive language-image learning
29.02.2024	Continue		Fig. 2 is unclear. How do they obtain a vector for a bag-of-words?
07.03.2024	Still (sic!) continue		It seems that they train using BoW, even though their inference pipeline does not reflect this.
14.03.2024	Sigmoid Loss for Language Image Pre-Training	X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer	HuggingFace
21.03.2024	Continue
28.03.2024	Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning	W. Liang, Y. Zhang, Y. Kwon, S. Yeung, J. Zou	GitHub Project Page
04.04.2024	What Makes Training Multi-modal Classification Networks Hard?	Wang, Tran, Feiszli
11.04.2024	MultiBench: Multiscale Benchmarks for Multimodal Representation Learning	Liang, Lyu, Fan, Wu, Cheng, Wu, Chen, Wu, Lee, Zhu, Salakhutdinaov, Morency	GitHub Project Page	Demos
16.04.2024	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Tong, Liu, Zhai, Ma, LeCun, Xie	GitHub Project Page	HuggingFace
23.04.2024	Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies	Li, Xie, Cubuk
30.04.2024	Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	Lu, Peng, Cheng, Galley, Chang, Wu, Zhu, Gao	GitHub, Project Page
07.05.2024	Many-Shot In-Context Learning	Agarwal, Singh, Zhang, Bohnet, Chan, Anand, Abbas, Nova, Co-Reyes, Chu, Behbahani, Faust, Larochelle	Not Provided
28.05.2024	BABILong: a long-context needle-in-a-haystack benchmark for LLMs	Kuratob, Bulatov, Anokhin, Sorokin, Sorokin, Burtsev	GitHub
04.06.2024	Continue
11.06.2024	4M: Massively Multimodal Masked Modeling	Mizrahi, Bachmann, Kar, Yeo, Gao, Dehghan, Zamir	GitHub Project Page
18.06.2024	Continue
25.06.2024	GLaMM: Pixel Grounding Large Multimodal Model	Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, Khan	GitHub Project Page	Demo
02.07.2024	Code Reading Group
09.07.2024	Knowledge Distillation	Gemma 2 (pdf), MobileLLM, Knowledge distillation, On-Policy distillation of Language Models
16.07.2024	Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities	Menon, Zemel, Vondrick	Project Page
23.07.2024	Multimodal Neurons in Artificial Neural Networks	Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, Olah
30.07.2024	Continue + (very briefly) CLIPPO
06.08.2024	Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!	Hessel, Lee
13.08.2024	Graph of Thoughts and Monte Carlo Tree Search	Monte Carlo Tree Search from Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B; Graph of Thoughts; Large Language Monkeys; STaR: Self-Taught Reasoner; Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents; Bonus! DeepSeek-Prover-V1.5	Tinygrad example of MCTS
15.10.2024
22.10.2024	Calibration Multimodal Learning	Ma, Zhang, Wu, Fu, Hu
08.11.2024	Towards Mamba: the S4 model and topic around: HiPPo, S4 paper, Annotated S4 blog post	Gu, Goel, Ré	GitHub
15.11.2024	Continue: HiPPO and S4
22.11.2024	Continue: Mamba & differences between Transformers and SSMs

Datasets and benchmarks

Liang et al. MULTIBENCH: Multiscale Benchmarks for Multimodal Representation Learning

Surveys

Liang et al. Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Representation Learning

Latent Space Structure

Liang et at Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Yuksekgonul et al. When and why vision-language models behave like bags-of-words, and what to do about it?

Fusion

Modality Competition. Quantitative Methods of Detection of Suboptimality.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multimodal-reading-group

About

Releases

Packages

iburenko/multimodal-reading-group

Folders and files

Latest commit

History

Repository files navigation

multimodal-reading-group

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages