Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Mutiprocess Evaluation time Bug #1115

Open
2 tasks done
wdndev opened this issue May 6, 2024 · 1 comment
Open
2 tasks done

[Improvement] Mutiprocess Evaluation time Bug #1115

wdndev opened this issue May 6, 2024 · 1 comment
Assignees

Comments

@wdndev
Copy link

wdndev commented May 6, 2024

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

this is my cpu and gpu, I used the following machine for the test, max-works=32

CPU Info :     255  AMD EPYC 7713 64-Core Processor
GPU Info : NVIDIA H800-SXM4-80GB x 8    

Reproduces the problem - code/configuration sample

office code

Reproduces the problem - command or script

  1. I used a 2B model (for example qwen1.5 1.8B) to test 13 datasets, and the model was loaded using Huggingface.
  2. I recoeded the time token for each section, and found that the infer task took about 20minutes and the evaluation task took about 12 minutes.
  3. I found the file to calculate ppl and gen scores (predictions dir), about 500M of memory, so, why does it take 12 minutes in multipreocess? Shouldn't a calculation of 500M be done in about a minute?
  4. I found the code to run the evaluation task (opencompass/runners/local.py line 61 ~ 210 ), and found that the most time-consuming part of the evaluation was the serialization and deserialization of the configuration file (disk landing and loading). The code looks like this :
# opencompass/runners/local.py  line 180 ~ 188
        # Dump task config to file
        mmengine.mkdir_or_exist('tmp/')
        param_file = f'tmp/{os.getpid()}_{index}_params.py'
        try:
            task.cfg.dump(param_file)     # **************  the most time-consuming
            tmpl = get_command_template(gpu_ids)
            get_cmd = partial(task.get_command,
                              cfg_path=param_file,
                              template=tmpl)
  1. When the task was started, I divided the evaluation task and the inference task. The inference task did not change, and the evaluation task eliminated the multi-process operation:
  • step 1 : Modify the submit function (opencompass/runners/local.py lines 133)
def submit(task, index):
    # ...

    if num_gpus > 0:
        tqdm.write(f'launch {task.name} on GPU ' +
                    ','.join(map(str, gpu_ids)))
    else:
        tqdm.write(f'launch {task.name} on CPU ')

    # Modify  Modify  Modify 
    if "OpenICLEvalTask" in self.task_cfg['type']:
        res = self._launch_eval(task, gpu_ids, index)
    else:
        res = self._launch_infer(task, gpu_ids, index)   # old self._launch

    pbar.update()

    with lock:
        gpus[gpu_ids] += 1 
    return res
  • step 2: new add self._launch_eval function
def _launch_eval(self, task, gpu_ids, index):
    logger = get_logger()
    task_name = task.name
    out_path = task.get_log_path(file_extension='out')
    mmengine.mkdir_or_exist(osp.split(out_path)[0])

    from opencompass.tasks.openicl_eval import OpenICLEvalTask

    start_time = time.time()
    exitcode = 0
    try:
        inferencer = OpenICLEvalTask(task.cfg)
        inferencer.run()
    except Exception as e:
        print("except: ", e)
        exitcode = 1

    end_time = time.time()
    get_logger().info(f'time elapsed: {end_time - start_time:.2f}s')

    if exitcode != 0:
        logger.error(f'exitcode {exitcode}, task {task_name} failed, see\n{out_path}')

    return task_name, exitcode
  1. It took me 40 seconds to run the 13 datasets before the revised code evaluation.
  2. so, i hope the authorities can improve this bug. And I am now modifying the code so that logs cannot be written to the evaluated datasets logs, so PR is not created!

Reproduces the problem - error message

Evaluation time improvement

Other information

No response

@tonysy
Copy link
Collaborator

tonysy commented May 9, 2024

Thanks for the report. We will look into this issue and update soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants