Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Potential Reason of LLaVA-NeXT-Qwen2's Strong Performance #6

Open
waxnkw opened this issue Jun 29, 2024 · 3 comments
Open

The Potential Reason of LLaVA-NeXT-Qwen2's Strong Performance #6

waxnkw opened this issue Jun 29, 2024 · 3 comments

Comments

@waxnkw
Copy link

waxnkw commented Jun 29, 2024

Great work! I notice the LLaVA-NeXT-Qwen2 (image model) can achieve a surprising 49.5 Video-MME results. In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score (according to https://arxiv.org/pdf/2406.07476 reproduction). The LLaVA-NeXT-Video (Llama3) also cover a normal LLaVA recipe and even with more video data. I am curious that what is the key factor of LLaVA-NeXT-Qwen2's strong performance compared with LLaVA-NeXT-Video (Llama3). Is the main improvement from Qwen2 LLM?

@jzhang38
Copy link
Collaborator

jzhang38 commented Jun 30, 2024

In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score

LLaVA-NeXT-Video-7B (https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-7B) is based on Vicuna, not Llama3.
In our test, it scores 40+ on Video-MME, see table 2 of our blog: https://lmms-lab.github.io/posts/lmms-eval-0.2/

@waxnkw
Copy link
Author

waxnkw commented Jun 30, 2024

Thanks so much for the response! Also sorry for my mistakes. I see that the result is 41.98 in Table 2. Great result!

BTW, are there some insights of the improvement from 41.98 (LLaVA-NeXT-Video) to 49.5 (LLaVA-NeXT-Qwen2)?

@jzhang38
Copy link
Collaborator

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

Hi I believe this blog post is
a good read about how better base LM enables stronger multimodal capabilities. I believe Qwen2 is just significantly better than Vicuna-1.5(llama2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants