We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我正在使用官方支持的任务/模型/数据集进行评估。
{'CUDA available': True, 'CUDA_HOME': '/usr', 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0', 'GPU 0,1,2,3,4,5': 'NVIDIA A800 80GB PCIe', 'MMEngine': '0.10.4', 'MUSA available': False, 'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24', 'OpenCV': '4.9.0', 'PyTorch': '2.3.0+cu121', 'PyTorch compiling details': 'PyTorch built with:\n' ...
configs/models 下面的配置都是 api_meta_template = dict( round=[ dict(role="HUMAN", api_role="HUMAN"), dict(role="BOT", api_role="BOT", generate=True), ], )
models = [ dict( abbr="vanilla_llama-2-7b-chat_V1", # type=Llama2Chat, type=HuggingFaceCausalLM, path="xxx", tokenizer_path="xxx", tokenizer_kwargs=dict(padding_side='left', truncation_side='left', use_fast=False, ), meta_template=api_meta_template, max_out_len=100, max_seq_len=2048, batch_size=8, extract_pred_after_decode = True, model_kwargs=dict(device_map='auto'), batch_padding=False, # if false, inference with for-loop without batch padding run_cfg=dict(num_gpus=1, num_procs=1), ), ]
python run.py --models llama --datasets triviaqa_gen
我使用去年大概10月份下载的版本,和现在的版本运行triviaqa数据集得到的结果差距较大
outputs得到的configs下面的.py文件如下
**(1) triviaqa ** 原始版本 结果为45 datasets=[ dict(abbr='triviaqa', eval_cfg=dict( evaluator=dict( type='opencompass.datasets.TriviaQAEvaluator'), pred_role='BOT'), infer_cfg=dict( inferencer=dict( max_out_len=50, type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( template=dict( round=[ dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?", role='HUMAN'), dict(prompt='A:', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( type='opencompass.openicl.icl_retriever.ZeroRetriever')), path='./data/triviaqa/', reader_cfg=dict( input_columns=[ 'question', ], output_column='answer', test_split='dev', train_split='dev'), type='opencompass.datasets.TriviaQADataset'), ] models=[ dict(abbr='llama', batch_padding=False, batch_size=8, extract_pred_after_decode=True, max_out_len=100, max_seq_len=2048, meta_template=dict( round=[ dict(api_role='HUMAN', role='HUMAN'), dict(api_role='BOT', generate=True, role='BOT'), ]), model_kwargs=dict( device_map='auto'), path='xx', run_cfg=dict( num_gpus=1, num_procs=1), tokenizer_kwargs=dict( padding_side='left', truncation_side='left', use_fast=False), tokenizer_path='xx', type='opencompass.models.HuggingFaceCausalLM'), ] summarizer=None work_dir='./outputs/default/20231012_212838'
(2)当前版本 结果为55 datasets=[ dict(abbr='triviaqa', eval_cfg=dict( evaluator=dict( type='opencompass.datasets.TriviaQAEvaluator'), pred_role='BOT'), infer_cfg=dict( inferencer=dict( max_out_len=50, type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( template=dict( round=[ dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?", role='HUMAN'), dict(prompt='A:', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( type='opencompass.openicl.icl_retriever.ZeroRetriever')), path='./data/triviaqa/', reader_cfg=dict( input_columns=[ 'question', ], output_column='answer', test_split='dev', train_split='dev'), type='opencompass.datasets.TriviaQADataset'), ] models=[ dict(abbr='llama', batch_padding=False, batch_size=8, extract_pred_after_decode=True, max_out_len=100, max_seq_len=2048, meta_template=dict( round=[ dict(api_role='HUMAN', role='HUMAN'), dict(api_role='BOT', generate=True, role='BOT'), ]), model_kwargs=dict( device_map='auto'), path='xx', run_cfg=dict( num_gpus=1, num_procs=1), tokenizer_kwargs=dict( padding_side='left', truncation_side='left', use_fast=False), tokenizer_path='xx', type='opencompass.models.HuggingFaceCausalLM'), ] summarizer=dict( summary_groups=[ dict(name='agieval-chinese', subsets=[ 'agieval-gaokao-chinese',
......
work_dir='./outputs/default/20240507_115016'
请问是什么原因导致的结果不一样?
The text was updated successfully, but these errors were encountered:
评估代码变宽松了
https://github.com/open-compass/opencompass/blob/19d7e630d6216550a56c8df572cce481c22f2ddc/opencompass/datasets/triviaqa.py#L88C1-L89C66
Sorry, something went wrong.
kennymckormick
No branches or pull requests
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
{'CUDA available': True,
'CUDA_HOME': '/usr',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
'GPU 0,1,2,3,4,5': 'NVIDIA A800 80GB PCIe',
'MMEngine': '0.10.4',
'MUSA available': False,
'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24',
'OpenCV': '4.9.0',
'PyTorch': '2.3.0+cu121',
'PyTorch compiling details': 'PyTorch built with:\n'
...
重现问题 - 代码/配置示例
configs/models 下面的配置都是
api_meta_template = dict(
round=[
dict(role="HUMAN", api_role="HUMAN"),
dict(role="BOT", api_role="BOT", generate=True),
],
)
models = [
dict(
abbr="vanilla_llama-2-7b-chat_V1",
# type=Llama2Chat,
type=HuggingFaceCausalLM,
path="xxx",
tokenizer_path="xxx",
tokenizer_kwargs=dict(padding_side='left',
truncation_side='left',
use_fast=False,
),
meta_template=api_meta_template,
max_out_len=100,
max_seq_len=2048,
batch_size=8,
extract_pred_after_decode = True,
model_kwargs=dict(device_map='auto'),
batch_padding=False, # if false, inference with for-loop without batch padding
run_cfg=dict(num_gpus=1, num_procs=1),
),
]
重现问题 - 命令或脚本
python run.py --models llama --datasets triviaqa_gen
重现问题 - 错误信息
我使用去年大概10月份下载的版本,和现在的版本运行triviaqa数据集得到的结果差距较大
outputs得到的configs下面的.py文件如下
**(1) triviaqa **
原始版本 结果为45
datasets=[
dict(abbr='triviaqa',
eval_cfg=dict(
evaluator=dict(
type='opencompass.datasets.TriviaQAEvaluator'),
pred_role='BOT'),
infer_cfg=dict(
inferencer=dict(
max_out_len=50,
type='opencompass.openicl.icl_inferencer.GenInferencer'),
prompt_template=dict(
template=dict(
round=[
dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?",
role='HUMAN'),
dict(prompt='A:',
role='BOT'),
]),
type='opencompass.openicl.icl_prompt_template.PromptTemplate'),
retriever=dict(
type='opencompass.openicl.icl_retriever.ZeroRetriever')),
path='./data/triviaqa/',
reader_cfg=dict(
input_columns=[
'question',
],
output_column='answer',
test_split='dev',
train_split='dev'),
type='opencompass.datasets.TriviaQADataset'),
]
models=[
dict(abbr='llama',
batch_padding=False,
batch_size=8,
extract_pred_after_decode=True,
max_out_len=100,
max_seq_len=2048,
meta_template=dict(
round=[
dict(api_role='HUMAN',
role='HUMAN'),
dict(api_role='BOT',
generate=True,
role='BOT'),
]),
model_kwargs=dict(
device_map='auto'),
path='xx',
run_cfg=dict(
num_gpus=1,
num_procs=1),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False),
tokenizer_path='xx',
type='opencompass.models.HuggingFaceCausalLM'),
]
summarizer=None
work_dir='./outputs/default/20231012_212838'
(2)当前版本 结果为55
datasets=[
dict(abbr='triviaqa',
eval_cfg=dict(
evaluator=dict(
type='opencompass.datasets.TriviaQAEvaluator'),
pred_role='BOT'),
infer_cfg=dict(
inferencer=dict(
max_out_len=50,
type='opencompass.openicl.icl_inferencer.GenInferencer'),
prompt_template=dict(
template=dict(
round=[
dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?",
role='HUMAN'),
dict(prompt='A:',
role='BOT'),
]),
type='opencompass.openicl.icl_prompt_template.PromptTemplate'),
retriever=dict(
type='opencompass.openicl.icl_retriever.ZeroRetriever')),
path='./data/triviaqa/',
reader_cfg=dict(
input_columns=[
'question',
],
output_column='answer',
test_split='dev',
train_split='dev'),
type='opencompass.datasets.TriviaQADataset'),
]
models=[
dict(abbr='llama',
batch_padding=False,
batch_size=8,
extract_pred_after_decode=True,
max_out_len=100,
max_seq_len=2048,
meta_template=dict(
round=[
dict(api_role='HUMAN',
role='HUMAN'),
dict(api_role='BOT',
generate=True,
role='BOT'),
]),
model_kwargs=dict(
device_map='auto'),
path='xx',
run_cfg=dict(
num_gpus=1,
num_procs=1),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False),
tokenizer_path='xx',
type='opencompass.models.HuggingFaceCausalLM'),
]
summarizer=dict(
summary_groups=[
dict(name='agieval-chinese',
subsets=[
'agieval-gaokao-chinese',
work_dir='./outputs/default/20240507_115016'
其他信息
请问是什么原因导致的结果不一样?
The text was updated successfully, but these errors were encountered: