Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

新老版本的结果差距 #1126

Open
2 tasks done
mengrusun opened this issue May 9, 2024 · 1 comment
Open
2 tasks done

新老版本的结果差距 #1126

mengrusun opened this issue May 9, 2024 · 1 comment
Assignees

Comments

@mengrusun
Copy link

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
'CUDA_HOME': '/usr',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
'GPU 0,1,2,3,4,5': 'NVIDIA A800 80GB PCIe',
'MMEngine': '0.10.4',
'MUSA available': False,
'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24',
'OpenCV': '4.9.0',
'PyTorch': '2.3.0+cu121',
'PyTorch compiling details': 'PyTorch built with:\n'
...

重现问题 - 代码/配置示例

configs/models 下面的配置都是
api_meta_template = dict(
round=[
dict(role="HUMAN", api_role="HUMAN"),
dict(role="BOT", api_role="BOT", generate=True),
],
)

models = [
dict(
abbr="vanilla_llama-2-7b-chat_V1",
# type=Llama2Chat,
type=HuggingFaceCausalLM,
path="xxx",
tokenizer_path="xxx",
tokenizer_kwargs=dict(padding_side='left',
truncation_side='left',
use_fast=False,
),
meta_template=api_meta_template,
max_out_len=100,
max_seq_len=2048,
batch_size=8,
extract_pred_after_decode = True,
model_kwargs=dict(device_map='auto'),
batch_padding=False, # if false, inference with for-loop without batch padding
run_cfg=dict(num_gpus=1, num_procs=1),
),
]

重现问题 - 命令或脚本

python run.py --models llama --datasets triviaqa_gen

重现问题 - 错误信息

我使用去年大概10月份下载的版本,和现在的版本运行triviaqa数据集得到的结果差距较大

outputs得到的configs下面的.py文件如下

**(1) triviaqa **
原始版本 结果为45
datasets=[
dict(abbr='triviaqa',
eval_cfg=dict(
evaluator=dict(
type='opencompass.datasets.TriviaQAEvaluator'),
pred_role='BOT'),
infer_cfg=dict(
inferencer=dict(
max_out_len=50,
type='opencompass.openicl.icl_inferencer.GenInferencer'),
prompt_template=dict(
template=dict(
round=[
dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?",
role='HUMAN'),
dict(prompt='A:',
role='BOT'),
]),
type='opencompass.openicl.icl_prompt_template.PromptTemplate'),
retriever=dict(
type='opencompass.openicl.icl_retriever.ZeroRetriever')),
path='./data/triviaqa/',
reader_cfg=dict(
input_columns=[
'question',
],
output_column='answer',
test_split='dev',
train_split='dev'),
type='opencompass.datasets.TriviaQADataset'),
]
models=[
dict(abbr='llama',
batch_padding=False,
batch_size=8,
extract_pred_after_decode=True,
max_out_len=100,
max_seq_len=2048,
meta_template=dict(
round=[
dict(api_role='HUMAN',
role='HUMAN'),
dict(api_role='BOT',
generate=True,
role='BOT'),
]),
model_kwargs=dict(
device_map='auto'),
path='xx',
run_cfg=dict(
num_gpus=1,
num_procs=1),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False),
tokenizer_path='xx',
type='opencompass.models.HuggingFaceCausalLM'),
]
summarizer=None
work_dir='./outputs/default/20231012_212838'

(2)当前版本 结果为55
datasets=[
dict(abbr='triviaqa',
eval_cfg=dict(
evaluator=dict(
type='opencompass.datasets.TriviaQAEvaluator'),
pred_role='BOT'),
infer_cfg=dict(
inferencer=dict(
max_out_len=50,
type='opencompass.openicl.icl_inferencer.GenInferencer'),
prompt_template=dict(
template=dict(
round=[
dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?",
role='HUMAN'),
dict(prompt='A:',
role='BOT'),
]),
type='opencompass.openicl.icl_prompt_template.PromptTemplate'),
retriever=dict(
type='opencompass.openicl.icl_retriever.ZeroRetriever')),
path='./data/triviaqa/',
reader_cfg=dict(
input_columns=[
'question',
],
output_column='answer',
test_split='dev',
train_split='dev'),
type='opencompass.datasets.TriviaQADataset'),
]
models=[
dict(abbr='llama',
batch_padding=False,
batch_size=8,
extract_pred_after_decode=True,
max_out_len=100,
max_seq_len=2048,
meta_template=dict(
round=[
dict(api_role='HUMAN',
role='HUMAN'),
dict(api_role='BOT',
generate=True,
role='BOT'),
]),
model_kwargs=dict(
device_map='auto'),
path='xx',
run_cfg=dict(
num_gpus=1,
num_procs=1),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False),
tokenizer_path='xx',
type='opencompass.models.HuggingFaceCausalLM'),
]
summarizer=dict(
summary_groups=[
dict(name='agieval-chinese',
subsets=[
'agieval-gaokao-chinese',

......

work_dir='./outputs/default/20240507_115016'

其他信息

请问是什么原因导致的结果不一样?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants