Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Long text evaluation parameters are not clear #1035

Open
2 tasks done
bullw opened this issue Apr 10, 2024 · 3 comments
Open
2 tasks done

[Bug] Long text evaluation parameters are not clear #1035

bullw opened this issue Apr 10, 2024 · 3 comments
Assignees

Comments

@bullw
Copy link

bullw commented Apr 10, 2024

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

python 3.10.1
OpenCompass 0.2.3
vllm 0.2.3

Reproduces the problem - code/configuration sample

configs/models/chatglm/vllm_chatglm2_6b_32k.py
from opencompass.models import VLLM

models = [
dict(
type=VLLM,
abbr='chatglm2-6b-32k-vllm',
path='THUDM/chatglm2-6b-32k',
max_out_len=512,
max_seq_len=4096,
batch_size=32,
generation_kwargs=dict(temperature=0),
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

Reproduces the problem - command or script

python run.py --model vllm_chatglm2_6b_32k --datasets longbench leval

Reproduces the problem - error message

The difference between the evaluation result parameters and the document long text evaluation is about 20 points, The score for the document can not be reproduced.

  1. “max_seq_len、max_out_len” Should these two parameters be modified in any way?

Other information

No response

@liushz
Copy link
Collaborator

liushz commented Apr 10, 2024

For optimal performance, it is advisable to configure the max_seq_len parameter to the highest value feasible, such as 32768 or even higher if possible. As for the max_out_len, it typically comes with a preset default value within the dataset configuration. You have the option to adjust this to 256, or you may simply retain the default setting.

@bullw
Copy link
Author

bullw commented Apr 12, 2024

Thank you very much. I reproduced most of the scores.

I also need to ask, indicators for rouge1, rouge2,rougeL,rougeLsum subset of the score difference is still very large.

  1. What is the reason wow?
  2. What are the indicators used in the rank?

image

image

@bullw
Copy link
Author

bullw commented Apr 12, 2024

@liushz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants