Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

Merged
merged 11 commits into from
May 23, 2024

Conversation

ImKeTT
Copy link
Collaborator

@ImKeTT ImKeTT commented Apr 18, 2024

This PR aims to add automatic GPT4V evaluation for vision-language originality evaluation. I also repurposed Mementos for creative storytelling.
Basically, the evaluation process is like: we feed the input image and the model-generated story to GPT4V and ask it to judge the given textual content from 1 to 5 (higher means the story is more original and creative).

We might need to discuss several prompts used in this evaluation. Currently, they are quite simple ones:

  1. Evaluation: Prompt fed into the GPT4V model to tell it the specific judging criteria. I think we need to explain different originality levels in detail.
  2. Repurposing scenarios: Prompt fed into the tested VLMs for generating original and creative stories given the image.

I've run ./pre-commit.sh and passed the local tests. Here are several output files of using GPT4V evaluation on Mementos (25 instances using Qwen-VL-Chat).
stats.json
per_instance_stats.json
scenario_state.json

The conf file and running command I used is:
conf file:

entries: [
    {description: "mementos:subject=comics,model=qwen/qwen-vl-chat", priority: 1}
    ]

bash command:

helm-run --conf-paths run_mem_specs.conf --suite v1 --max-eval-instances 25

Please let me know how can I improve this PR, thanks!

Oops, there seems to be a bug, fixing it...

Seems that we cannot run the GPT4V evaluator with minimal dependencies only (we need to import openai). What do you recommend to do to pass the test?

@yifanmai yifanmai requested a review from teetone April 19, 2024 23:44
@ImKeTT
Copy link
Collaborator Author

ImKeTT commented Apr 25, 2024

Friendly ping @teetone, please don't forget to take a look at this PR when you are free :)

@teetone
Copy link
Member

teetone commented Apr 27, 2024

Friendly ping @teetone, please don't forget to take a look at this PR when you are free :)

@ImKeTT Sorry! I was working on a big merge at the moment. I will take a look by tomorrow.

@ImKeTT
Copy link
Collaborator Author

ImKeTT commented Apr 27, 2024

@ImKeTT Sorry! I was working on a big merge at the moment. I will take a look by tomorrow.

That's ok, no rush. I just wanted to ask, since people sometimes forget.

Copy link
Member

@teetone teetone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ImKeTT I think we want the metric to have all the knowledge about the Likert scale and criteria, and the client should not be aware of any of this. Also, we want to use this auto model evaluator evaluate on other aspects aside from originality.

Could we actually follow https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/image_generation/photorealism_critique_metrics.py and use the CritiqueClient. We would want to use. critique_type == "model" to route it to the ModelCritiqueClient. I think you just have to add multimodal support for ModelCritiqueClient (cc @yifanmai). Sorry for the trouble.

src/helm/benchmark/metrics/common_metric_specs.py Outdated Show resolved Hide resolved
@teetone teetone requested a review from yifanmai April 29, 2024 06:46
@ImKeTT
Copy link
Collaborator Author

ImKeTT commented Apr 29, 2024

Thanks for the pointer, will refactor the code soon.

@ImKeTT ImKeTT marked this pull request as draft May 10, 2024 15:45
@ImKeTT
Copy link
Collaborator Author

ImKeTT commented May 12, 2024

Hi @yifanmai, I have refactored the GPT4V-aided originality evaluation for VHELM. We may still need to discuss about the detailed prompts that are fed into the GPT4V to make it a qualified evaluator.

The command I used for a canary run is:

helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 20

The run_entries.conf file is like:

entries: [
    {description: "mementos:subject=comics,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1}
    ]

And the credentials.conf file in the prod_env folder is like:

openaiApiKey: the-openai-api-key
critiqueModelName: openai/gpt-4-vision-preview
critiqueType: model

Here're some results:
scenario_state.json
per_instance_stats.json
scenario.json

Please let me know how can I improve this version, thanks! :)

@ImKeTT ImKeTT marked this pull request as ready for review May 12, 2024 13:41
@ImKeTT
Copy link
Collaborator Author

ImKeTT commented May 17, 2024

Hi @yifanmai @teetone, could you take a peek at the refactored GPT4V-aided evaluation whenever you have time? Would love to get your feedbacks on this PR before moving forward with another for the Rika-Vibe evaluator using the multimodal ModelCritiqueClient.

@teetone teetone self-requested a review May 21, 2024 14:25
src/helm/common/gpt4v_originality_request.py Outdated Show resolved Hide resolved
src/helm/common/critique_request.py Outdated Show resolved Hide resolved
@ImKeTT
Copy link
Collaborator Author

ImKeTT commented May 22, 2024

Hi @teetone, I have run the metric on 100 instances from the refactored mementos using Qwen-VL-Chat. Here's the result:

run_spec.json
scenario.json
per_instance_stats.json

There's a chance that the GPT-4-Vision generates the first word other than a capital character from 'A, B, C, D, E' (5 out of 100 cases), which will result in a parsing failure and an invalid evaluation instance (the evaluation can run smoothly but 5 evaluation instances will be ignored).

@teetone teetone merged commit 7ead1d6 into stanford-crfm:main May 23, 2024
6 checks passed
@ImKeTT ImKeTT deleted the gpt4v-eval branch May 23, 2024 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants