在测评的时候显存总是有空闲，如何全部利用显存呢，单机 8*80G的 A800 #1017

listwebit · 2024-04-02T01:48:26Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

在用零一万物 34B进行 ceval测评的时候，显存只用了到了20%左右，调整哪个参数可以100%使用显存呢

Reproduces the problem - code/configuration sample

调整了batch_size也没有用

Reproduces the problem - command or script

调整了batch_size也没有用

Reproduces the problem - error message

调整了batch_size也没有用

Other information

调整了batch_size也没有用

IcyFeather233 · 2024-04-08T10:45:40Z

我也是有同样的问题，有的模型可以把显存吃满，比如我用Qwen14b-chat就可以把8*A800差不多吃满，但是有的模型，比如autoj-bilingual-6b，我的八卡每张卡的占用率还不到10%，我怀疑这个是跟模型有关

另外，感觉开启VLLM会让占用率变高推理速度变快，如果是使用HuggingFaceCausalLM占用率和速度都会低一些

acylam · 2024-04-28T08:34:23Z

Prerequisite
* [x]  I have searched [Issues](https://github.com/open-compass/opencompass/issues/) and [Discussions](https://github.com/open-compass/opencompass/discussions) but cannot get the expected help.

* [x]  The bug has not been fixed in the [latest version](https://github.com/open-compass/opencompass).
Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

在用零一万物 34B进行 ceval测评的时候，显存只用了到了20%左右，调整哪个参数可以100%使用显存呢

Reproduces the problem - code/configuration sample

调整了batch_size也没有用

Reproduces the problem - command or script

调整了batch_size也没有用

Reproduces the problem - error message

调整了batch_size也没有用

Other information

调整了batch_size也没有用

Please check the run_cfg.num_gpus parameter in your model config. The default num_gpus is set to a different value for each supported model：

opencompass/configs/models/yi/hf_yi_34b_chat.py

Line 28 in cce5b6f

run_cfg=dict(num_gpus=2, num_procs=1),

202030481266 · 2024-05-03T17:18:31Z

实际上原理就是每一个GPU都是加载了一个模型的实例进去推理，所以占用的显存不高。若 batch_size>1，需要同时设置 batch_padding=True，否则实际仍然是按照单 batch 来进行推理的。另外我们测试下来会发现 batch_padding=True 时由于 positional encoding 错位等原因，评测精度会下降，因此建议不要走 batch。

202030481266 · 2024-05-03T17:30:34Z

增加 --max-workers-per-gpu 参数即可，这样子一张卡将会有多个worker实例

dh12306 · 2024-05-10T07:31:42Z

实际上原理就是每一个GPU都是加载了一个模型的实例进去推理，所以占用的显存不高。若 batch_size>1，需要同时设置 batch_padding=True，否则实际仍然是按照单 batch 来进行推理的。另外我们测试下来会发现 batch_padding=True 时由于 positional encoding 错位等原因，评测精度会下降，因此建议不要走 batch。

你们评估mmlu 需要多久，我跑llama 13b，batch=1测 mmlu_gen 半天不动：

--datasets mmlu_gen --max-out-len 100  --max-seq-len 2048 --batch-size 1 --no-batch-padding --num-gpus 4  --max-workers-per-gpu 2

gpu看着在用：

202030481266 · 2024-05-10T18:24:23Z

实际上原理就是每一个GPU都是加载了一个模型的实例进去推理，所以占用的显存不高。若 batch_size>1，需要同时设置 batch_padding=True，否则实际仍然是按照单 batch 来进行推理的。另外我们测试下来会发现 batch_padding=True 时由于 positional encoding 错位等原因，评测精度会下降，因此建议不要走 batch。

你们评估mmlu 需要多久，我跑llama 13b，batch=1测 mmlu_gen 半天不动：
--datasets mmlu_gen --max-out-len 100  --max-seq-len 2048 --batch-size 1 --no-batch-padding --num-gpus 4  --max-workers-per-gpu 2
gpu看着在用：

没测过mmlu，我这边测试，cmmlu+ceval，Internlm-7b，h800*2情况下大概半个小时

mm-assistant bot assigned bittersweet1999 Apr 2, 2024

bittersweet1999 assigned Leymore Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在测评的时候显存总是有空闲，如何全部利用显存呢，单机 8*80G的 A800 #1017

在测评的时候显存总是有空闲，如何全部利用显存呢，单机 8*80G的 A800 #1017

listwebit commented Apr 2, 2024

IcyFeather233 commented Apr 8, 2024

acylam commented Apr 28, 2024 •

edited

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

202030481266 commented May 3, 2024 •

edited

202030481266 commented May 3, 2024

dh12306 commented May 10, 2024 •

edited

202030481266 commented May 10, 2024

在测评的时候显存总是有空闲，如何全部利用显存呢，单机 8*80G的 A800 #1017

在测评的时候显存总是有空闲，如何全部利用显存呢，单机 8*80G的 A800 #1017

Comments

listwebit commented Apr 2, 2024

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

IcyFeather233 commented Apr 8, 2024

acylam commented Apr 28, 2024 • edited

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

202030481266 commented May 3, 2024 • edited

202030481266 commented May 3, 2024

dh12306 commented May 10, 2024 • edited

202030481266 commented May 10, 2024

acylam commented Apr 28, 2024 •

edited

202030481266 commented May 3, 2024 •

edited

dh12306 commented May 10, 2024 •

edited