You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there!
Thanks for the effort to maintain this amazing repository.
This is a request to add our recent work on evaluation of Video Models. We propose an evaluation benchmark, VELOCITI.
Please find relevant details below,
Title:
VELOCITI: Can Video-Language Models Bind Semantic Concepts Through Time?
About
To keep up with the rapid pace with which Video-Language Models (VLM) are being proposed, our primary motivation is to provide a benchmark to evaluate current SoTA, as well as upcoming VLMs on Compositionality, which is a fundamental aspect of vision- language understanding. This is achieved through carefully designed tests, which evaluate various aspects of perception and binding. With this, we aim to provide a more accurate gauge of VLM capabilities, encouraging research towards improving VLMs and preventing shortcomings that may percolate into the systems that rely on such models.
Sorry for the late response. It's incorporated now.
Please also consider citing our works:
@article{yin2024survey,
title={A survey on multimodal large language models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
journal={National Science Review},
pages={nwae403},
year={2024},
publisher={Oxford University Press}
}
@article{fu2023mme,
title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
journal={arXiv preprint arXiv:2306.13394},
year={2023}
}
@article{fu2024mme,
title={MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs},
author={Fu, Chaoyou and Zhang, Yi-Fan and Yin, Shukang and Li, Bo and Fang, Xinyu and Zhao, Sirui and Duan, Haodong and Sun, Xing and Liu, Ziwei and Wang, Liang and others},
journal={arXiv preprint arXiv:2411.15296},
year={2024}
}
Hi there!
Thanks for the effort to maintain this amazing repository.
This is a request to add our recent work on evaluation of Video Models. We propose an evaluation benchmark, VELOCITI.
Please find relevant details below,
Title:
VELOCITI: Can Video-Language Models Bind Semantic Concepts Through Time?
About
To keep up with the rapid pace with which Video-Language Models (VLM) are being proposed, our primary motivation is to provide a benchmark to evaluate current SoTA, as well as upcoming VLMs on Compositionality, which is a fundamental aspect of vision- language understanding. This is achieved through carefully designed tests, which evaluate various aspects of perception and binding. With this, we aim to provide a more accurate gauge of VLM capabilities, encouraging research towards improving VLMs and preventing shortcomings that may percolate into the systems that rely on such models.
ArXiv
https://arxiv.org/abs/2406.10889v1
GitHub
https://github.com/katha-ai/VELOCITI
Project Page and Demo
https://katha-ai.github.io/projects/velociti/
Please let me know if I missed some required details.
Thanks for your time.
The text was updated successfully, but these errors were encountered: