负样本挖掘时的选择范围 #790

AugustLHHHHHH · 2024-05-15T12:44:35Z

您好，想再问一下挖掘负样本时选择的范围

多语言版本的msmarco数据中，https://microsoft.github.io/msmarco/, 一个问题对应一个负样本

通过hn_mine.py挖掘更多负样本时，范围是从input_file的已有neg中选择的吗？还是其他呢？
另外，candidate_pool可以设置为语料库(msmarco给的collections)中排除测试集的文档吗？
谢谢

staoxiao · 2024-05-15T17:42:48Z

@AugustLHHHHHH , we mined hard negatives from the entire corpus of msmarco.

AugustLHHHHHH · 2024-05-16T07:35:42Z

@AugustLHHHHHH , we mined hard negatives from the entire corpus of msmarco.

Thanks!

Provide feedback