Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two questions that i want to solve #167

Open
yeptttt opened this issue Mar 18, 2024 · 2 comments
Open

two questions that i want to solve #167

yeptttt opened this issue Mar 18, 2024 · 2 comments
Labels
question Further information is requested

Comments

@yeptttt
Copy link

yeptttt commented Mar 18, 2024

您好,目前有两个迫切需要解决的问题,能否帮忙解答。
1.gpu显存存在一个上限,是否支持提高显存到24GB以提高gpu的利用率?
2.如何同时输入多个问题进行并行推理?

@yeptttt yeptttt added the question Further information is requested label Mar 18, 2024
@hodlen
Copy link
Collaborator

hodlen commented Apr 6, 2024

  1. 目前我们的实现无法做到非常准确的GPU显存占用估计,因此存在轻微的显存浪费。如果你希望尽可能占用满显存,可以尝试设置 --vram-budget 到一个大于物理显存大小,但在你的工作负载下不会OOM的值。
  2. 如果你指的是prompt相同的并行推理,可以使用 examples/batched 。对于不同的prompt, examples/server--cont-batching 可能有所帮助,但并不建议使用,因为我们在测试中发现它会生成明显错误的结果。

  1. Currently, our implementation cannot accurately estimate GPU memory usage, leading to minor memory wastage. If you wish to maximize memory usage, you might try setting --vram-budget to a value larger than the physical memory size, yet small enough to prevent Out of Memory (OOM) errors under your workload.
  2. For parallel inference with the same prompt, you can utilize the examples/batched feature. For parallel processing of different prompts, the --cont-batching option in examples/server might be helpful, although it is not recommended. Our tests have shown that it can lead to significantly incorrect results.

@yeptttt
Copy link
Author

yeptttt commented Apr 11, 2024

  1. 目前我们的实现无法做到非常准确的GPU显存占用估计,因此存在轻微的显存浪费。如果你希望尽可能占用满显存,可以尝试设置 --vram-budget 到一个大于物理显存大小,但在你的工作负载下不会OOM的值。

  2. 如果你指的是prompt相同的并行推理,可以使用 examples/batched 。对于不同的prompt, examples/server--cont-batching 可能有所帮助,但并不建议使用,因为我们在测试中发现它会生成明显错误的结果。

  3. Currently, our implementation cannot accurately estimate GPU memory usage, leading to minor memory wastage. If you wish to maximize memory usage, you might try setting --vram-budget to a value larger than the physical memory size, yet small enough to prevent Out of Memory (OOM) errors under your workload.

  4. For parallel inference with the same prompt, you can utilize the examples/batched feature. For parallel processing of different prompts, the --cont-batching option in examples/server might be helpful, although it is not recommended. Our tests have shown that it can lead to significantly incorrect results.

目前第一个问题就是GPU显存通过 --vram-budget 只能最大调到11GB,但我希望使用更多显存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants