-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPT-4V as evaluator #276
base: main
Are you sure you want to change the base?
Conversation
drcege
commented
Mar 22, 2024
- Initial version to enrich the multimodal evaluation features, using GPT4V API to assess models
- Welcome further testing and refinement
@HYLcool Tested and improved with @zhijianma |
Maybe postpone the merge until the sandbox builds the pipeline. |
以图像到文本(image-to-text)的生成任务为例,每个 JSON 对象应该包括 `image` 和 `text` 键。样例输入文件格式如下: | ||
|
||
```JSON | ||
{"image": "/path/to/image0", "text": "generated caption"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需不需要保持跟data-juicer的jsonl结构一致呢?sandbox整个流程都保持一种数据结构可能会更好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
都是 JSONL 结构,是说 key 的不同?需要 sandbox 确定之后微调对接。
目前 DJ 里的 text/image/video/audio 应该也不是写死的,可以通过传入 text_key / image_key /... 等参数指定。
这里还有两个相关问题:
- @HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios,似乎始终定义为列表。考虑评测场景下,通常是根据输入的 prompt 生成一张图片/视频,或者根据给定的图片/视频生成一段 caption,每个测试样例应该只有一个图片/视频输出,要始终包围在列表中吗? 如果是这么理解,看起来会比较繁琐;我倾向于将默认的 key 改为单数,只代表类别/模态的概念,允许单个元素或列表。
- @BeachWang 我这里还实现了一种
pairwise comparison
的评测方法,对比一个输入的两种输出(相当于打擂台),比如 text-to-image 任务下需要text
和image_0
,image_1
三个key,必然跟 DJ 默认的输出结构不一致,期望用户自己构建。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这种可以输入两个json文件吗?保持顺序一致这样子呢?就可以保持跟dj格式一样了,感觉sandbox需要先确定一个统一的数据格式@HYLcool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- @HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios,似乎始终定义为列表。考虑评测场景下,通常是根据输入的 prompt 生成一张图片/视频,或者根据给定的图片/视频生成一段 caption,每个测试样例应该只有一个图片/视频输出,要始终包围在列表中吗? 如果是这么理解,看起来会比较繁琐;我倾向于将默认的 key 改为单数,只代表类别/模态的概念,允许单个元素或列表。
主要是如果一个数据集里既有单个元素也有列表的话,这个数据集的这一列会被认为类型不匹配,从而不能被正确载入,因此当时就选了列表来兼容这些不同的情况。虽然大部分数据集(包括评测数据集)的确通常只包括一个多模态数据,但是按照最新一些MLLM工作中的数据集组成来看,也会存在单个样本中包括多个多模态数据的情况。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools/mm_eval/gpt4v/compare.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare函数应该随机一下位置,比如text0和text1,随机互换一下,记录winner再换回原来的顺序。因为有工作证明LLM对顺序是有偏的,我们应该让E(eval(texts0, texts1) = E(eval(texts1, texts0))。
This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day. |
Close this stale PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Plz implement GPT-4V Evaluator accordingly in sandbox later
This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day. |
Close this stale PR. |