[Improvement] Mutiprocess Evaluation time Bug #1115

wdndev · 2024-05-06T11:07:23Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

this is my cpu and gpu, I used the following machine for the test, max-works=32

CPU Info :     255  AMD EPYC 7713 64-Core Processor
GPU Info : NVIDIA H800-SXM4-80GB x 8

Reproduces the problem - code/configuration sample

office code

Reproduces the problem - command or script

I used a 2B model (for example qwen1.5 1.8B) to test 13 datasets, and the model was loaded using Huggingface.
I recoeded the time token for each section, and found that the infer task took about 20minutes and the evaluation task took about 12 minutes.
I found the file to calculate ppl and gen scores (predictions dir), about 500M of memory, so, why does it take 12 minutes in multipreocess? Shouldn't a calculation of 500M be done in about a minute?
I found the code to run the evaluation task (opencompass/runners/local.py line 61 ~ 210 ), and found that the most time-consuming part of the evaluation was the serialization and deserialization of the configuration file (disk landing and loading). The code looks like this :

# opencompass/runners/local.py  line 180 ~ 188
        # Dump task config to file
        mmengine.mkdir_or_exist('tmp/')
        param_file = f'tmp/{os.getpid()}_{index}_params.py'
        try:
            task.cfg.dump(param_file)     # **************  the most time-consuming
            tmpl = get_command_template(gpu_ids)
            get_cmd = partial(task.get_command,
                              cfg_path=param_file,
                              template=tmpl)

When the task was started, I divided the evaluation task and the inference task. The inference task did not change, and the evaluation task eliminated the multi-process operation:

step 1 : Modify the submit function (opencompass/runners/local.py lines 133)

def submit(task, index):
    # ...

    if num_gpus > 0:
        tqdm.write(f'launch {task.name} on GPU ' +
                    ','.join(map(str, gpu_ids)))
    else:
        tqdm.write(f'launch {task.name} on CPU ')

    # Modify  Modify  Modify 
    if "OpenICLEvalTask" in self.task_cfg['type']:
        res = self._launch_eval(task, gpu_ids, index)
    else:
        res = self._launch_infer(task, gpu_ids, index)   # old self._launch

    pbar.update()

    with lock:
        gpus[gpu_ids] += 1 
    return res

step 2: new add self._launch_eval function

def _launch_eval(self, task, gpu_ids, index):
    logger = get_logger()
    task_name = task.name
    out_path = task.get_log_path(file_extension='out')
    mmengine.mkdir_or_exist(osp.split(out_path)[0])

    from opencompass.tasks.openicl_eval import OpenICLEvalTask

    start_time = time.time()
    exitcode = 0
    try:
        inferencer = OpenICLEvalTask(task.cfg)
        inferencer.run()
    except Exception as e:
        print("except: ", e)
        exitcode = 1

    end_time = time.time()
    get_logger().info(f'time elapsed: {end_time - start_time:.2f}s')

    if exitcode != 0:
        logger.error(f'exitcode {exitcode}, task {task_name} failed, see\n{out_path}')

    return task_name, exitcode

It took me 40 seconds to run the 13 datasets before the revised code evaluation.
so, i hope the authorities can improve this bug. And I am now modifying the code so that logs cannot be written to the evaluated datasets logs, so PR is not created!

Reproduces the problem - error message

Evaluation time improvement

Other information

No response

The text was updated successfully, but these errors were encountered:

tonysy · 2024-05-09T04:07:38Z

Thanks for the report. We will look into this issue and update soon.

mm-assistant bot assigned Leymore May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Mutiprocess Evaluation time Bug #1115

[Improvement] Mutiprocess Evaluation time Bug #1115

wdndev commented May 6, 2024

tonysy commented May 9, 2024

[Improvement] Mutiprocess Evaluation time Bug #1115

[Improvement] Mutiprocess Evaluation time Bug #1115

Comments

wdndev commented May 6, 2024

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

tonysy commented May 9, 2024