Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

ImKeTT · 2024-04-18T13:40:21Z

This PR aims to add automatic GPT4V evaluation for vision-language originality evaluation. I also repurposed Mementos for creative storytelling.
Basically, the evaluation process is like: we feed the input image and the model-generated story to GPT4V and ask it to judge the given textual content from 1 to 5 (higher means the story is more original and creative).

We might need to discuss several prompts used in this evaluation. Currently, they are quite simple ones:

Evaluation: Prompt fed into the GPT4V model to tell it the specific judging criteria. I think we need to explain different originality levels in detail.
Repurposing scenarios: Prompt fed into the tested VLMs for generating original and creative stories given the image.

I've run ./pre-commit.sh and passed the local tests. Here are several output files of using GPT4V evaluation on Mementos (25 instances using Qwen-VL-Chat).
stats.json
per_instance_stats.json
scenario_state.json

The conf file and running command I used is:
conf file:

entries: [
    {description: "mementos:subject=comics,model=qwen/qwen-vl-chat", priority: 1}
    ]

bash command:

helm-run --conf-paths run_mem_specs.conf --suite v1 --max-eval-instances 25

Please let me know how can I improve this PR, thanks!

~~Oops, there seems to be a bug, fixing it...~~

Seems that we cannot run the GPT4V evaluator with minimal dependencies only (we need to import openai). What do you recommend to do to pass the test?

…ario

ImKeTT · 2024-04-25T15:05:14Z

Friendly ping @teetone, please don't forget to take a look at this PR when you are free :)

teetone · 2024-04-27T00:01:33Z

Friendly ping @teetone, please don't forget to take a look at this PR when you are free :)

@ImKeTT Sorry! I was working on a big merge at the moment. I will take a look by tomorrow.

ImKeTT · 2024-04-27T01:34:52Z

@ImKeTT Sorry! I was working on a big merge at the moment. I will take a look by tomorrow.

That's ok, no rush. I just wanted to ask, since people sometimes forget.

teetone

@ImKeTT I think we want the metric to have all the knowledge about the Likert scale and criteria, and the client should not be aware of any of this. Also, we want to use this auto model evaluator evaluate on other aspects aside from originality.

Could we actually follow https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/image_generation/photorealism_critique_metrics.py and use the CritiqueClient. We would want to use. critique_type == "model" to route it to the ModelCritiqueClient. I think you just have to add multimodal support for ModelCritiqueClient (cc @yifanmai). Sorry for the trouble.

src/helm/benchmark/metrics/common_metric_specs.py

ImKeTT · 2024-04-29T07:15:23Z

Thanks for the pointer, will refactor the code soon.

ImKeTT · 2024-05-12T13:41:52Z

Hi @yifanmai, I have refactored the GPT4V-aided originality evaluation for VHELM. We may still need to discuss about the detailed prompts that are fed into the GPT4V to make it a qualified evaluator.

The command I used for a canary run is:

helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 20

The run_entries.conf file is like:

entries: [
    {description: "mementos:subject=comics,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1}
    ]

And the credentials.conf file in the prod_env folder is like:

openaiApiKey: the-openai-api-key
critiqueModelName: openai/gpt-4-vision-preview
critiqueType: model

Here're some results:
scenario_state.json
per_instance_stats.json
scenario.json

Please let me know how can I improve this version, thanks! :)

ImKeTT · 2024-05-17T08:05:38Z

Hi @yifanmai @teetone, could you take a peek at the refactored GPT4V-aided evaluation whenever you have time? Would love to get your feedbacks on this PR before moving forward with another for the Rika-Vibe evaluator using the multimodal ModelCritiqueClient.

…v-eval

src/helm/common/gpt4v_originality_request.py

src/helm/common/critique_request.py

src/helm/benchmark/run_specs/vlm_run_specs.py

ImKeTT · 2024-05-22T08:52:21Z

Hi @teetone, I have run the metric on 100 instances from the refactored mementos using Qwen-VL-Chat. Here's the result:

run_spec.json
scenario.json
per_instance_stats.json

There's a chance that the GPT-4-Vision generates the first word other than a capital character from 'A, B, C, D, E' (5 out of 100 cases), which will result in a parsing failure and an invalid evaluation instance (the evaluation can run smoothly but 5 evaluation instances will be ignored).

…rd-crfm#2576)

add gpt4v_originality_score evaluator and repurpose the mementos scen…

184a4b3

…ario

yifanmai requested a review from teetone April 19, 2024 23:44

teetone requested changes Apr 29, 2024

View reviewed changes

src/helm/benchmark/metrics/common_metric_specs.py Outdated Show resolved Hide resolved

teetone requested a review from yifanmai April 29, 2024 06:46

ImKeTT added 2 commits April 29, 2024 20:50

Merge branch 'stanford-crfm:main' into gpt4v-eval

7f3821e

Merge branch 'main' into gpt4v-eval

8638760

ImKeTT marked this pull request as draft May 10, 2024 15:45

ImKeTT and others added 2 commits May 12, 2024 21:10

Merge branch 'stanford-crfm:main' into gpt4v-eval

a206701

refactor gpt4v originality evaluation

259f4b3

ImKeTT marked this pull request as ready for review May 12, 2024 13:41

ImKeTT and others added 4 commits May 19, 2024 05:51

add gpt4v metric_specs

120a453

Merge branch 'gpt4v-eval' of https://github.com/ImKeTT/helm into gpt4…

fd7fea2

…v-eval

Merge branch 'stanford-crfm:main' into gpt4v-eval

b165e2a

reformat

edb8a44

teetone self-requested a review May 21, 2024 14:25

teetone approved these changes May 21, 2024

View reviewed changes

src/helm/common/gpt4v_originality_request.py Outdated Show resolved Hide resolved

src/helm/common/critique_request.py Outdated Show resolved Hide resolved

teetone reviewed May 21, 2024

View reviewed changes

src/helm/benchmark/run_specs/vlm_run_specs.py Outdated Show resolved Hide resolved

refine gpt4v originality evaluator

258c133

Update vlm_run_specs.py

360068f

teetone merged commit 7ead1d6 into stanford-crfm:main May 23, 2024
6 checks passed

ImKeTT deleted the gpt4v-eval branch May 23, 2024 06:14

xuwangyin pushed a commit to xuwangyinx/helm that referenced this pull request Jun 23, 2024

Add Automatic GPT4V Evaluation for VLM Originality Evaluation (stanfo…

0fc1508

…rd-crfm#2576)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

ImKeTT commented Apr 18, 2024 •

edited

Loading

ImKeTT commented Apr 25, 2024

teetone commented Apr 27, 2024 •

edited

Loading

ImKeTT commented Apr 27, 2024

teetone left a comment

ImKeTT commented Apr 29, 2024

ImKeTT commented May 12, 2024 •

edited

Loading

ImKeTT commented May 17, 2024

ImKeTT commented May 22, 2024

Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

Add Automatic GPT4V Evaluation for VLM Originality Evaluation #2576

Conversation

ImKeTT commented Apr 18, 2024 • edited Loading

ImKeTT commented Apr 25, 2024

teetone commented Apr 27, 2024 • edited Loading

ImKeTT commented Apr 27, 2024

teetone left a comment

Choose a reason for hiding this comment

ImKeTT commented Apr 29, 2024

ImKeTT commented May 12, 2024 • edited Loading

ImKeTT commented May 17, 2024

ImKeTT commented May 22, 2024

ImKeTT commented Apr 18, 2024 •

edited

Loading

teetone commented Apr 27, 2024 •

edited

Loading

ImKeTT commented May 12, 2024 •

edited

Loading