IEEE Transactions on Multimedia (TMM)
IEEE Multimedia Magazine (MM)
Information Fusion (IF)
ACM International Conference on Multimedia (ACM MM)
IEEE International Conference on Multimedia & Expo (ICME)
Multimodal Communication and Machine Learning Laboratory (MultiComp Lab)
Louis-Philippe Morency
Carnegie Mellon University
[Homepage]
awesome-multimodal-ml
[Github]
Awesome-Multimodal-Research
[Github]
Multimodal Machine Learning A Survey and Taxonomy.
Tadas Baltrušaitis; Chaitanya Ahuja; Louis-Philippe Morency.
TPAMI, 2019. [PDF]
Deep multimodal learning A survey on recent advances and trends.
D Ramachandram, GW Taylor.
TIP, 2017. [PDF]
Multimodal intelligence: Representation learning, information fusion, and applications.
C Zhang, Z Yang, X He, L Deng.
IEEE Journal of Selected Topics in Signal Processing, 2020. [PDF]
A survey on multimodal large language models.
S Yin, C Fu, S Zhao, K Li, X Sun, T Xu, et al.
arXiv, 2023. [PDF]
Multimodal foundation models: From specialists to general-purpose assistants.
C Li, Z Gan, Z Yang, J Yang, L Li, L Wang, J Gao.
Foundations and Trends® in Computer Graphics and Vision, 2024. [PDF]
Multimodality representation learning: A survey on evolution, pretraining and its applications.
MA Manzoor, S Albarri, Z Xian, Z Meng, P Nakov, S Liang.
ACM Transactions on Multimedia Computing, Communications and Applications, 2023. [PDF]
Vlp: A survey on vision-language pre-training.
FL Chen, DZ Zhang, ML Han, XY Chen, J Shi, S Xu, B Xu.
Machine Intelligence Research, 2023. [PDF]
Vision-language models for vision tasks: A survey.
J Zhang, J Huang, S Jin, S Lu.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. [PDF]
Mm-llms: Recent advances in multimodal large language models.
D Zhang, Y Yu, C Li, J Dong, D Su, C Chu, D Yu.
arXiv:2401.13601, 2024. [PDF]
DeViSE: A Deep Visual-Semantic Embedding Model.
A Frome, G Corrado, J Shlens, S Bengio, J Dean, et al.
NIPS, 2013. [PDF]
Learning joint embedding with multimodal cues for cross-modal video-text retrieval.
NC Mithun, J Li, F Metze, et al.
ICMR, 2018. [PDF]
Deep multimodal representation learning: A survey.
W Guo, J Wang, S Wang.
IEEE Access, 2019. [PDF]
Multibench: Multiscale benchmarks for multimodal representation learning.
PP Liang, Y Lyu, X Fan, Z Wu, Y Cheng, J Wu, et al.
NeurlPS, 2021.
Show and tell: A neural image caption generator.
O Vinyals, A Toshev, S Bengio, et al.
CVPR, 2015. [PDF]
Show, attend and tell: Neural image caption generation with visual attention.
K Xu, J Ba, R Kiros, K Cho, A Courville, et al.
PMLR, 2018. [PDF]
A survey on automatic image caption generation.
S Bai, S An.
Neurocomputing, 2018. [PDF]
Stacked cross attention for image-text matching.
KH Lee, X Chen, G Hua, H Hu, et al.
CVPR, 2018. [PDF]
Visual Semantic Reasoning for Image-Text Matching.
K Li, Y Zhang, K Li, Y Li, Y Fu.
CVPR, 2020. [PDF]
Oscar Object-Semantics Aligned Pre-training for Vision-Language Tasks.
X Li, X Yin, C Li, P Zhang, X Hu, L Zhang, et al.
ECCV, 2020. [PDF]
Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval.
D Gao, L Jin, B Chen, M Qiu, P Li, Y Wei, Y Hu, et al.
SIGIR, 2020. [PDF]
Similarity Reasoning and Filtration for Image-Text Matching.
H Diao, Y Zhang, L Ma, H Lu.
AAAI, 2021. [PDF]
Attention-Based Multimodal Fusion for Video Description.
C Hori, T Hori, TY Lee, Z Zhang, et al.
CVPR, 2017. [PDF]
Multimodal keyless attention fusion for video classification.
X Long, C Gan, G De Melo, X Liu, Y Li, F Li, et al.
AAAI, 2018. [PDF]
MFAS Multimodal Fusion Architecture Search.
JM Pérez-Rúa, V Vielzeuf, S Pateux, et al.
CVPR, 2019. [PDF]
Quantum-inspired multimodal fusion for video sentiment analysis.
Q Li, D Gkoumas, C Lioma, M Melucci.
Information Fusion, 2020. [PDF]
A survey on machine learning for data fusion.
T Meng, X Jing, Z Yan, W Pedrycz.
Information Fusion, 2020. [PDF]
Attention bottlenecks for multimodal fusion.
A Nagrani, S Yang, A Arnab, A Jansen, et al.
NIPS, 2021. [PDF]
Dynamic Multimodal Fusion.
Z Xue, R Marculescu.
ArXiv, 2022. [PDF]
Visualbert: A simple and performant baseline for vision and language.
LH Li, M Yatskar, D Yin, CJ Hsieh, et al.
ArXiv, 2019.
Lxmert: Learning cross-modality encoder representations from transformers.
H Tan, M Bansal.
EMNLP, 2019. [PDF]
Videobert: A joint model for video and language representation learning.
C Sun, A Myers, C Vondrick, et al.
ICCV, 2019.
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.
J Lu, D Batra, D Parikh, S Lee.
NIPS, 2019. [PDF]
Univl: A unified video and language pre-training model for multimodal understanding and generation.
H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li, et al.
ArXiv, 2020. [PDF]
Actbert: Learning global-local video-text representations.
L Zhu, Y Yang.
CVPR, 2020.
What Makes Training Multi-modal Classification Networks Hard.
W Wang, D Tran, M Feiszli.
CVPR, 2020. [PDF]
Hero: Hierarchical encoder for video+ language omni-representation pre-training.
L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu.
EMNLP, 2020.
Vl-bert: Pre-training of generic visual-linguistic representations.
W Su, X Zhu, Y Cao, B Li, L Lu, F Wei, J Dai.
ICLR, 2020.
Foundations of multimodal co-learning.
A Zadeh, PP Liang, LP Morency.
Information Fusion, 2020. [PDF]
Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning.
W Li, C Gao, G Niu, X Xiao, H Liu, J Liu, H Wu, et al.
ArXiv, 2021. [PDF]
M6: A chinese multimodal pretrainer.
J Lin, R Men, A Yang, C Zhou, M Ding, Y Zhang, et al.
ArXiv, 2021.
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions.
A Rahate, R Walambe, S Ramanna, et al.
ArXiv, 2021. [PDF]
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text.
Q Li, B Gong, Y Cui, D Kondratyuk, X Du, et al.
ArXiv, 2021.
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.
W Wang, H Bao, L Dong, F Wei.
ArXiv, 2021.
WenLan: Bridging vision and language by large-scale multi-modal pre-training.
Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang, et al.
ArXiv, 2021.
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model.
N Fei, Z Lu, Y Gao, G Yang, Y Huo, J Wen, H Lu, R Song, X Gao, T Xiang, et al.
ArXiv, 2021.
Learning transferable visual models from natural language supervision.
A Radford, JW Kim, C Hallacy, et al.
ICML, 2021.
Vilt: Vision-and-language transformer without convolution or region supervision.
W Kim, B Son, I Kim.
ICML, 2021.
Zero-shot text-to-image generation.
A Ramesh, M Pavlov, G Goh, S Gray, et al.
ICML, 2021.
Multimodal few-shot learning with frozen language models.
M Tsimpoukelli, J Menick, S Cabi, et al.
NIPS, 2021.
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework.
J Gu, X Meng, G Lu, L Hou, M Niu, H Xu, et al.
ArXiv, 2022.
Deep Collaborative Embedding for Social Image Understanding.
Z Li, J Tang, T Mei.
TPAMI, 2019.
[PDF]
Layoutlm: Pre-training of text and layout for document image understanding.
Y Xu, M Li, L Cui, S Huang, F Wei, M Zhou.
KDD, 2020.
Beyond visual semantics: Exploring the role of scene text in image understanding.
AU Dey, SK Ghosh, E Valveny, G Harit.
Pattern Recognition Letters, 2021.
Eco Efficient convolutional network for online video understanding.
M Zolfaghari, K Singh, T Brox.
ECCV, 2018.
[PDF]
TSM Temporal Shift Module for Efficient Video Understanding.
J Lin, C Gan, S Han.
CVPR, 2019.
[PDF]
Assemblenet: Searching for multi-stream neural connectivity in video architectures.
MS Ryoo, AJ Piergiovanni, M Tan, et al.
ICLR, 2020.
Is Space-Time Attention All You Need for Video Understanding?
G Bertasius, H Wang, et al.
ArXiv, 2021.
MDMMT Multidomain Multimodal Transformer for Video Retrieval.
M Dzabraev, M Kalashnikov, et al.
CVPR, 2021.
[PDF]