Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] 请问使用vllm评测时怎么实现类似HF多卡数据并行? #1002

Open
1 task
noforit opened this issue Mar 26, 2024 · 15 comments
Open
1 task
Assignees

Comments

@noforit
Copy link

noforit commented Mar 26, 2024

描述该功能

我在评测时的模型type 为vllm,参数如下:
image
但是显卡占用只使用了一张卡来评测任务
image
我想让任务划分为几份分别在8张卡上评测,这种功能可以添加吗?还是说可以实现,麻烦解答一下。非常感激!
类似我如果设定为模型type为HF的话,会自动达到这种效果。
image
image

是否希望自己实现该功能?

  • 我希望自己来实现这一功能,并向 OpenCompass 贡献代码!
@liushz
Copy link
Collaborator

liushz commented Mar 26, 2024

image
like above cfg, you can set model_kwargs=dict(tensor_parallel_size=8), for your case.

@noforit
Copy link
Author

noforit commented Mar 26, 2024

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

@andakai
Copy link

andakai commented Mar 27, 2024

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

hi, @liushz , I also want to know how to achieve data parallelism in vLLM when evaluating

@tonysy
Copy link
Collaborator

tonysy commented Mar 27, 2024

@noforit
Copy link
Author

noforit commented Mar 28, 2024

@tonysy Could you possibly offer a quick example? I'm quite unsure how to ues it. Many thanks for your assistance.

@IcyFeather233
Copy link
Contributor

@noforit
Copy link
Author

noforit commented Apr 1, 2024

@IcyFeather233 谢谢你😂,我明白这个tensor_parallel_size可以设定为GPU数2,4,8实现模型分片并行。我这里意思是tensor_parallel_size为1,但是GPU 每张卡都加载一整个模型,然后数据并行,同时评测一个任务的不同数据。最近我实现了该种功能,使用NumWorkerPartitioner。以下为关键参数配置:有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。
image

@Zbaoli
Copy link

Zbaoli commented Apr 9, 2024

@noforit 我是这样配的,但还是只有一张卡在跑,能帮我看看原因吗;

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=2),
    runner=dict(
        type=LocalRunner,
        max_num_workers=16,
        task=dict(type=OpenICLInferTask))
)
models = [
    dict(
        type=VLLM,
        abbr='qwen-7b-chat-vllm',
        path="/home/zbl/data/llm/qwen/Qwen-7B-Chat",
        model_kwargs=dict(tensor_parallel_size=1),
        meta_template=_meta_template,
        max_out_len=100,
        max_seq_len=2048,
        batch_size=100,
        generation_kwargs=dict(temperature=0),
        end_str='<|im_end|>',
    )
]

@Zbaoli
Copy link

Zbaoli commented Apr 9, 2024

@IcyFeather233 我知道你的意思,tensor_parallel_size参数可以设置多卡推理,但我试了下使用多卡推理速度并没有比单卡变快;
所以我想实现的是多个任务并行推理:比如我有n个任务,同时用m个模型,每个模型执行一个任务的推理;

@noforit
Copy link
Author

noforit commented Apr 9, 2024

@Zbaoli 我看你的参数和我 差了一个
image
加个这个试试?

@Zbaoli
Copy link

Zbaoli commented Apr 9, 2024

@noforit 谢谢你的回复,但我在models的配置中加了run_cfg=dict(num_gpus=1, num_proces=1)参数之后还是只有一个 gpu 在运行;

@noforit
Copy link
Author

noforit commented Apr 9, 2024

@Zbaoli 奇怪😂。在程序运行前 加上 CUDA_VISIBLE_DEVICES 呢
image
或者你在/opencompass/opencompass/runners/local.py 里面调试一下?里面会自动检测显卡数量啥的
加个微信?我发你邮件

@guoaoo
Copy link

guoaoo commented Apr 10, 2024

@IcyFeather233 谢谢你😂,我明白这个tensor_parallel_size可以设定为GPU数2,4,8实现模型分片并行。我这里意思是tensor_parallel_size为1,但是GPU 每张卡都加载一整个模型,然后数据并行,同时评测一个任务的不同数据。最近我实现了该种功能,使用NumWorkerPartitioner。以下为关键参数配置:有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。 image

这里使用了NumWorkerPartitioner后,数据集被拆分成了8份,但最终的summary没法将拆分后的数据集的指标结果汇总在一起,请问您会这样吗?

@caotianjia
Copy link

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

我请教下,opencompass提供的Sizepartitioner不就可以对数据集进行切割么?还是说NumWorkerPartitioner的partition方式要更高效一些?

@bittersweet1999
Copy link
Collaborator

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

我请教下,opencompass提供的Sizepartitioner不就可以对数据集进行切割么?还是说NumWorkerPartitioner的partition方式要更高效一些?

size partitioner和numworker partitioner是两种不同的切分方式,一个是按给定的size切分,一个是按照卡的数目切分

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants