Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPT-4V as evaluator #276

Closed
wants to merge 15 commits into from
Closed

Add GPT-4V as evaluator #276

wants to merge 15 commits into from

Conversation

drcege
Copy link
Collaborator

@drcege drcege commented Mar 22, 2024

  • Initial version to enrich the multimodal evaluation features, using GPT4V API to assess models
  • Welcome further testing and refinement

@drcege drcege added enhancement New feature or request dj:multimodal issues/PRs about multimodal data processing labels Mar 22, 2024
@drcege drcege added this to the DJ-SORA milestone Mar 22, 2024
@drcege drcege self-assigned this Mar 22, 2024
@drcege
Copy link
Collaborator Author

drcege commented Mar 25, 2024

@HYLcool Tested and improved with @zhijianma

@drcege drcege requested a review from HYLcool March 25, 2024 12:30
@drcege
Copy link
Collaborator Author

drcege commented Mar 25, 2024

Maybe postpone the merge until the sandbox builds the pipeline.

以图像到文本(image-to-text)的生成任务为例,每个 JSON 对象应该包括 `image` 和 `text` 键。样例输入文件格式如下:

```JSON
{"image": "/path/to/image0", "text": "generated caption"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需不需要保持跟data-juicer的jsonl结构一致呢?sandbox整个流程都保持一种数据结构可能会更好

Copy link
Collaborator Author

@drcege drcege Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都是 JSONL 结构,是说 key 的不同?需要 sandbox 确定之后微调对接。
目前 DJ 里的 text/image/video/audio 应该也不是写死的,可以通过传入 text_key / image_key /... 等参数指定。

这里还有两个相关问题:

  1. @HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios,似乎始终定义为列表。考虑评测场景下,通常是根据输入的 prompt 生成一张图片/视频,或者根据给定的图片/视频生成一段 caption,每个测试样例应该只有一个图片/视频输出,要始终包围在列表中吗? 如果是这么理解,看起来会比较繁琐;我倾向于将默认的 key 改为单数,只代表类别/模态的概念,允许单个元素或列表。
  2. @BeachWang 我这里还实现了一种 pairwise comparison 的评测方法,对比一个输入的两种输出(相当于打擂台),比如 text-to-image 任务下需要 textimage_0, image_1 三个key,必然跟 DJ 默认的输出结构不一致,期望用户自己构建。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这种可以输入两个json文件吗?保持顺序一致这样子呢?就可以保持跟dj格式一样了,感觉sandbox需要先确定一个统一的数据格式@HYLcool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. @HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios,似乎始终定义为列表。考虑评测场景下,通常是根据输入的 prompt 生成一张图片/视频,或者根据给定的图片/视频生成一段 caption,每个测试样例应该只有一个图片/视频输出,要始终包围在列表中吗? 如果是这么理解,看起来会比较繁琐;我倾向于将默认的 key 改为单数,只代表类别/模态的概念,允许单个元素或列表。

主要是如果一个数据集里既有单个元素也有列表的话,这个数据集的这一列会被认为类型不匹配,从而不能被正确载入,因此当时就选了列表来兼容这些不同的情况。虽然大部分数据集(包括评测数据集)的确通常只包括一个多模态数据,但是按照最新一些MLLM工作中的数据集组成来看,也会存在单个样本中包括多个多模态数据的情况。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这种可以输入两个json文件吗?保持顺序一致这样子呢?就可以保持跟dj格式一样了,感觉sandbox需要先确定一个统一的数据格式@HYLcool

@yxdyc 需要你看下

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare函数应该随机一下位置,比如text0和text1,随机互换一下,记录winner再换回原来的顺序。因为有工作证明LLM对顺序是有偏的,我们应该让E(eval(texts0, texts1) = E(eval(texts1, texts0))。

Copy link

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

Copy link

Close this stale PR.

@github-actions github-actions bot closed this Apr 20, 2024
@HYLcool HYLcool reopened this Apr 22, 2024
@HYLcool HYLcool removed the stale-pr label Apr 22, 2024
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Plz implement GPT-4V Evaluator accordingly in sandbox later

Copy link

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

Copy link

Close this stale PR.

@github-actions github-actions bot closed this May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:multimodal issues/PRs about multimodal data processing enhancement New feature or request stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants