S2 pretrained model of InternVideo2 does not work well for Zero-Shot Video-Text Retrieval #107

Wenju-Huang · 2024-04-22T05:43:24Z

直接跑demo/demo.ipynb, 模型选用https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 发现效果不太理想。
首先需要修改两个地方才能正确加载模型：
1、demo/demo.ipynb 中在setup_internvideo2(config)前面加上一句 config['pretrained_path'] = model_pth
2、demo/utils.py 第82和84行改成is_pretrain=True
修改后demo中提供的视频和10个句子的相似度分数（不经过softmax）为：

可以发现分数最高者并不是正确的描述，同时十个句子得分都比较接近。

SHYuanBest · 2024-04-22T08:22:52Z

我也遇到了类似的问题

shepnerd · 2024-04-22T09:47:15Z

提供的示例是个极端难的样例。这些句子是通过从视频中提取的关键元素（例如狗、人类、雪、玩耍）由GPT生成，来描述视频内容的，所以它们本身很多意思都很接近。我们打算利用这个案例研究来展示我们对进一步深入理解运动描述中的微妙区别的兴趣。

正常的视频检索，看看试试我们模型在主流基准上的测试或者试一些正常的例子可以感受下它的效果。

The provided example poses a considerable challenge for video understanding models due to its high level of complexity. These sentences have been generated by GPT, utilizing key elements such as dogs, humans, snow, and play that were extracted from the video in order to accurately capture the essence of the footage. We intend to leverage this case study to demonstrate our commitment to the deep understanding of subtle and detailed motion descriptions in the future.

Furthermore, we encourage you to evaluate our performance in typical video retrieval cases, as it consistently ranks among the top performers across mainstream benchmarks.

1093842024 · 2024-04-22T15:58:58Z

重复运行demo代码，发现结果每次都不一样。模型内部有随机化流程，还是和seed有关？
下面是连续两次跑的结果

SHYuanBest · 2024-04-23T03:04:38Z

在导入包之后加上下面的代码，如果直接运行能够保证结果一样。但是模型内部应该还有随机化流程，如果同样的seed在运行一次加载权重，结果就会变。
seed = 4491734
print("Seed:", seed)

np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)

leexinhao · 2024-04-23T08:55:11Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

Wenju-Huang · 2024-04-25T09:03:28Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

leexinhao · 2024-04-26T08:00:39Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

是的，请问是得到的分数排序不一样吗，我理解只要排序一样，分数有细微差异也可以接受，这个问题其实我们很早就发现了，可能是代码隐式的包含了一些并行加速模块之类的，导致每次计算有一些随机波动，但是不影响最终排序结果

Wenju-Huang · 2024-04-28T02:41:02Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

是的，请问是得到的分数排序不一样吗，我理解只要排序一样，分数有细微差异也可以接受，这个问题其实我们很早就发现了，可能是代码隐式的包含了一些并行加速模块之类的，导致每次计算有一些随机波动，但是不影响最终排序结果

顺序也不一样，我得到的结果是这样的

SHYuanBest · 2024-04-28T04:38:45Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

是的，请问是得到的分数排序不一样吗，我理解只要排序一样，分数有细微差异也可以接受，这个问题其实我们很早就发现了，可能是代码隐式的包含了一些并行加速模块之类的，导致每次计算有一些随机波动，但是不影响最终排序结果

顺序也不一样，我得到的结果是这样的

Yes, the results are very relevant to the setting of the random number seed.

SHYuanBest · 2024-04-28T04:39:22Z

Can you release the 6B model? Although it may not improve much.

leexinhao · 2024-04-28T16:32:15Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

是的，请问是得到的分数排序不一样吗，我理解只要排序一样，分数有细微差异也可以接受，这个问题其实我们很早就发现了，可能是代码隐式的包含了一些并行加速模块之类的，导致每次计算有一些随机波动，但是不影响最终排序结果

顺序也不一样，我得到的结果是这样的

如果在加载完模型后更换多个随机数种子，结果还是会发生变化吗，如果会的话，说明确实是计算误差，反之可能是因为权重没加载对导致有权重是随机初始化的。

SHYuanBest · 2024-04-29T02:18:20Z

@SHYuanBest @1093842024 @Wenju-Huang 之前的demo加载权重有些问题，实际上没加载对预训练权重，现已修复，模型内部是没有随机化流程的，多次运行的轻微差异可能是由于Pytorch的计算误差。

我上面是修复模型加载问题后的结果，只不过是没有经softmax做归一化，加上softmax后结果和您的还是不一样，请问你测试的是https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4/blob/main/InternVideo2-stage2_1b-224p-f4.pt 这个模型吗。

是的，请问是得到的分数排序不一样吗，我理解只要排序一样，分数有细微差异也可以接受，这个问题其实我们很早就发现了，可能是代码隐式的包含了一些并行加速模块之类的，导致每次计算有一些随机波动，但是不影响最终排序结果

顺序也不一样，我得到的结果是这样的

如果在加载完模型后更换多个随机数种子，结果还是会发生变化吗，如果会的话，说明确实是计算误差，反之可能是因为权重没加载对导致有权重是随机初始化的。

在导入包之后加上下面的代码，如果直接运行能够保证结果一样。但是模型内部应该还有随机化流程，如果同样的seed在运行一次加载权重，结果就会变。
seed = 4491734
print("Seed:", seed)

np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)

leexinhao · 2024-04-29T17:09:13Z

@shepnerd @Wenju-Huang @1093842024 @SHYuanBest I find this bug is because we forget add model.eval() in demo, we need to enable it to turn off drop path, now this bug was fixed.

JayMay1994 · 2024-05-23T03:56:36Z

@shepnerd @Wenju-Huang @1093842024 @SHYuanBest I find this bug is because we forget add model.eval() in demo, we need to enable it to turn off drop path, now this bug was fixed.

I changed the text_candidates to 700 list of kinetics class names but no playing with dog or play snow likewise is picked for example.mp4 . Could you help me. thx :P

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2 pretrained model of InternVideo2 does not work well for Zero-Shot Video-Text Retrieval #107

S2 pretrained model of InternVideo2 does not work well for Zero-Shot Video-Text Retrieval #107

Wenju-Huang commented Apr 22, 2024 •

edited

SHYuanBest commented Apr 22, 2024

shepnerd commented Apr 22, 2024

1093842024 commented Apr 22, 2024

SHYuanBest commented Apr 23, 2024

leexinhao commented Apr 23, 2024

Wenju-Huang commented Apr 25, 2024

leexinhao commented Apr 26, 2024

Wenju-Huang commented Apr 28, 2024

SHYuanBest commented Apr 28, 2024

SHYuanBest commented Apr 28, 2024

leexinhao commented Apr 28, 2024

SHYuanBest commented Apr 29, 2024

leexinhao commented Apr 29, 2024

JayMay1994 commented May 23, 2024 •

edited

S2 pretrained model of InternVideo2 does not work well for Zero-Shot Video-Text Retrieval #107

S2 pretrained model of InternVideo2 does not work well for Zero-Shot Video-Text Retrieval #107

Comments

Wenju-Huang commented Apr 22, 2024 • edited

SHYuanBest commented Apr 22, 2024

shepnerd commented Apr 22, 2024

1093842024 commented Apr 22, 2024

SHYuanBest commented Apr 23, 2024

leexinhao commented Apr 23, 2024

Wenju-Huang commented Apr 25, 2024

leexinhao commented Apr 26, 2024

Wenju-Huang commented Apr 28, 2024

SHYuanBest commented Apr 28, 2024

SHYuanBest commented Apr 28, 2024

leexinhao commented Apr 28, 2024

SHYuanBest commented Apr 29, 2024

leexinhao commented Apr 29, 2024

JayMay1994 commented May 23, 2024 • edited

Wenju-Huang commented Apr 22, 2024 •

edited

JayMay1994 commented May 23, 2024 •

edited