Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 0.4.0是还不支持qwen1.5 110b吗? #1536

Open
2 tasks done
starsliao opened this issue Apr 30, 2024 · 7 comments
Open
2 tasks done

[Bug] 0.4.0是还不支持qwen1.5 110b吗? #1536

starsliao opened this issue Apr 30, 2024 · 7 comments
Assignees

Comments

@starsliao
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

我用0.4.0的版本跑qwen110b的模型一直报错内存不足, --quant-policy 4也不行. 是目前还不支持吗?
我用vllm跑qwen110b PF16, GPTQ-Int4 都是运行正常的.
我的机器是8卡 v100 32G

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 7 has a total capacity of 31.73 GiB of which 848.44 MiB is free. Including non-PyTorch memory, this process has 30.90 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Reproduction

lmdeploy serve api_server /Qwen1.5-110B-Chat --server-port 23333 --model-name qwen110b --tp 8 --log-level INFO --quant-policy 4

Environment

LMDeploy: 0.4.0+
transformers: 4.40.1
gradio: Not Found
fastapi: 0.110.3
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

@lvhan028
Copy link
Collaborator

支持的

@lvhan028
Copy link
Collaborator

请看一下pipeline.md文档关于内存分配的说明

@starsliao
Copy link
Author

请看一下pipeline.md文档关于内存分配的说明
我调整了内存和token用下面命令跑也不行.

lmdeploy serve api_server /Qwen1.5-110B-Chat --server-port 23333 --model-name qwen110b --tp 8 --log-level INFO --cache-max-entry-count 0.01 --session-len 2000

我使用vllm跑8K token, 执行如下命令可以正常运行:

python -m vllm.entrypoints.openai.api_server --served-model-name qwen110b --model /Qwen1.5-110B-Chat/ --dtype=float16 --tensor-parallel-size=8 --gpu-memory-utilization=0.99 --max-model-len=8000 --block-size=32

lmdeploy是要消耗更多内存么?

@lzhangzz
Copy link
Collaborator

lzhangzz commented May 1, 2024

目前码表没有按TP切分,Qwen的码表特别大影响会比较明显。

我看看怎么加一下

@starsliao
Copy link
Author

目前码表没有按TP切分,Qwen的码表特别大影响会比较明显。

我看看怎么加一下

好的 感谢.

@starsliao
Copy link
Author

目前码表没有按TP切分,Qwen的码表特别大影响会比较明显。

我看看怎么加一下

大佬们么, 8卡 v100 32G,有希望跑qwen110b么 ,这个框架性能确实强,跑72b比vllm强很多, 速度快,token能跑满32K.很期待.

@lzhangzz
Copy link
Collaborator

lzhangzz commented May 8, 2024

会支持,不过没那么快,估计2周以后了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants