-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576
Conversation
Friendly ping @teetone, please don't forget to take a look at this PR when you are free :) |
That's ok, no rush. I just wanted to ask, since people sometimes forget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ImKeTT I think we want the metric to have all the knowledge about the Likert scale and criteria, and the client should not be aware of any of this. Also, we want to use this auto model evaluator evaluate on other aspects aside from originality.
Could we actually follow https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/image_generation/photorealism_critique_metrics.py and use the CritiqueClient
. We would want to use. critique_type == "model"
to route it to the ModelCritiqueClient
. I think you just have to add multimodal support for ModelCritiqueClient
(cc @yifanmai). Sorry for the trouble.
Thanks for the pointer, will refactor the code soon. |
Hi @yifanmai, I have refactored the GPT4V-aided originality evaluation for VHELM. We may still need to discuss about the detailed prompts that are fed into the GPT4V to make it a qualified evaluator. The command I used for a canary run is:
The
And the
Here're some results: Please let me know how can I improve this version, thanks! :) |
Hi @teetone, I have run the metric on 100 instances from the refactored mementos using Qwen-VL-Chat. Here's the result: run_spec.json There's a chance that the GPT-4-Vision generates the first word other than a capital character from 'A, B, C, D, E' (5 out of 100 cases), which will result in a parsing failure and an invalid evaluation instance (the evaluation can run smoothly but 5 evaluation instances will be ignored). |
This PR aims to add automatic GPT4V evaluation for vision-language originality evaluation. I also repurposed Mementos for creative storytelling.
Basically, the evaluation process is like: we feed the input image and the model-generated story to GPT4V and ask it to judge the given textual content from 1 to 5 (higher means the story is more original and creative).
We might need to discuss several prompts used in this evaluation. Currently, they are quite simple ones:
I've run
./pre-commit.sh
and passed the local tests. Here are several output files of using GPT4V evaluation on Mementos (25 instances using Qwen-VL-Chat).stats.json
per_instance_stats.json
scenario_state.json
The conf file and running command I used is:
conf
file:bash command:
Please let me know how can I improve this PR, thanks!
Oops, there seems to be a bug, fixing it...Seems that we cannot run the GPT4V evaluator with minimal dependencies only (we need to import
openai
). What do you recommend to do to pass the test?