WIP docs(README): add lmdeploy #152

tpoisonooo · 2023-08-17T12:48:22Z

这是 lmdeploy 相关的知乎介绍和测试结果

《使用 LMDeploy 轻松部署 Llama-2 系列模型》 https://zhuanlan.zhihu.com/p/645877584
《LLM 低成本 GPU 部署方案 lmdeploy 开源》 https://zhuanlan.zhihu.com/p/642934459
《6G显存玩转大模型，更快更省的4bit量化推理硬核开源》 https://zhuanlan.zhihu.com/p/650233050

由于 wiki 没法 PR，只能 owner 调整 wiki 目录，我 fork 了 Chinese-LLaMA-Alpaca-2，增加了

这是两个文档原始的 markdown 内容：

lmdeploy 安装和使用

lmdeploy 支持 transformer 结构（例如 LLaMA、LLaMa2、InternLM、Vicuna 等），目前支持 fp16，int8 和 int4。

一、安装

安装预编译的 python 包

python3 -m pip install lmdeploy

二、fp16 推理

把模型转成 lmdeploy 推理格式，假设 huggingface 版 LLaMa2 模型已下载到 /models/llama-2-7b-chat 目录，结果会存到 workspace 文件夹

python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat

在命令行中测试聊天效果

python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you

..
Hello! I'm just an AI assistant ..

也可以用 gradio 启动 WebUI 来聊天

python3 -m lmdeploy.serve.gradio.app ./workspace

lmdeploy 同样支持原始的 facebook 模型格式、支持 70B 模型分布式推理，用法请查看 lmdeploy 官方文档。

三、kv cache int8 量化

lmdeploy 实现了 kv cache int8 量化，同样的显存可以服务更多并发用户。

首先获取量化参数，结果保存到 fp16 转换好的 workspace/triton_models/weights 下，7B 模型也不需要 tensor parallel。

python3 -m lmdeploy.lite.apis.kv_qparams \ 
  --work_dir /models/llama-2-7b-chat \                 # huggingface 格式模型
  --turbomind_dir ./workspace/triton_models/weights \  # 结果保存目录
  --kv_sym False \                                     # 用非对称量化
  --num_tp 1                                           # tensor parallel GPU 个数

然后修改推理配置，开启 kv cache int8。编辑 workspace/triton_models/weights/config.ini

把 use_context_fmha 改为 0，表示关闭 flashattention
把 quant_policy 设为 4，表示打开 kv cache 量化

最终执行测试即可

python3 -m lmdeploy.turbomind.chat ./workspace

点击这里查看 kv cache int8 量化实现公式、精度和显存测试报告。

四、weight int4 量化

lmdeploy 基于 AWQ 算法实现了 weight int4 量化，相对 fp16 版本，速度是 3.16 倍、显存从 16G 降低到 6.3G。

这里有 AWQ 算法优化好 llama2 原始模型，直接下载。

git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

对于自己的模型，可以用auto_awq工具来优化，假设你的 huggingface 模型保存在 /models/llama-2-7b-chat

python3 -m lmdeploy.lite.apis.auto_awq \
  --model /models/llama-2-7b-chat \
  --w_bits 4 \                       # 权重量化的 bit 数
  --w_group_size 128 \               # 权重量化分组统计尺寸
  --work_dir ./llama2-chat-7b-w4     # 保存量化参数的目录

执行以下命令，即可在终端与模型对话：

## 转换模型的layout，存放在默认路径 ./workspace 下
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## 推理
python3 -m lmdeploy.turbomind.chat ./workspace

点击这里查看 weight int4 量化的显存和速度测试结果。

额外说明，weight int4 和 kv cache int8 二者并不冲突、可以同时打开，节约更多显存。

lmdeploy Usage

lmdeploy supports transformer structures (such as LLaMA, LLaMa2, InternLM, Vicuna, etc.), currently supporting fp16, int8, and int4.

I. Installation

Install the precompiled python package

python3 -m pip install lmdeploy

II. fp16 Inference

Convert the model to lmdeploy inference format, assuming the huggingface version of the LLaMa2 model has been downloaded to the /models/llama-2-7b-chat directory, and the results will be stored in the workspace folder

python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat

Test the chat on the command line

python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you

..
Hello! I'm just an AI assistant ..

You can also start WebUI to chat with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

lmdeploy also supports the original Facebook model format and supports 70B model distributed inference. For usage, please refer to lmdeploy official documentation.

III. kv cache int8 Quantization

lmdeploy implements kv cache int8 quantization, and the same memory can serve more concurrent users.

First obtain the quantization parameters, the result is saved in workspace/triton_models/weights after fp16 conversion, and there is no need for tensor parallel for the 7B model.

python3 -m lmdeploy.lite.apis.kv_qparams \ 
  --work_dir /models/llama-2-7b-chat \                 # huggingface format model
  --turbomind_dir ./workspace/triton_models/weights \  # Result save directory
  --kv_sym False \                                     # Use asymmetric quantization
  --num_tp 1                                           # Number of tensor parallel GPUs

Then modify the inference configuration to enable kv cache int8. Edit workspace/triton_models/weights/config.ini

Change use_context_fmha to 0, indicating that flashattention is turned off
Set quant_policy to 4, indicating that kv cache quantization is enabled

Finally execute the test

python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the kv cache int8 quantization implementation formula, accuracy and memory test report.

IV. weight int4 Quantization

lmdeploy based on the AWQ algorithm implemented weight int4 quantization, relative to the fp16 version, the speed is 3.16 times, and the memory is reduced from 16G to 6.3G.

Here is the AWQ algorithm optimized llama2 original model, you can just download it.

git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

For your own model, you can use the auto_awq tool to optimize it, assuming your huggingface model is saved in /models/llama-2-7b-chat

python3 -m lmdeploy.lite.apis.auto_awq \
  --model /models/llama-2-7b-chat \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Weight Quantization Group Statistical Size
  --work_dir ./llama2-chat-7b-w4     # Directory to save quantization parameters

Run the following command to chat with the model in the terminal:

## Convert the model's layout and store it in the default path ./workspace
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## Inference
python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the memory and speed test results of weight int4 quantization.

Additionally, weight int4 and kv cache int8 do not conflict and can be turned on at the same time to save more memory.

tpoisonooo · 2023-08-18T03:04:36Z

@ymcui please review.

ymcui · 2023-08-18T03:15:48Z

Thanks for your contribution. We'll schedule a PR review asap.
Note that we might make necessary modifications to README.md to satisfy our editing policies.

GoGoJoestar · 2023-08-21T05:02:53Z

I tried inference by TurboMind as following, but it didn't output any response.

Try Inference with PyTorch and it works:

Does TurboMind have requirement on GPU? I run it on a P40 gpu.

tpoisonooo · 2023-08-23T06:28:57Z

I tried inference by TurboMind as following, but it didn't output any response.

Try Inference with PyTorch and it works:

Does TurboMind have requirement on GPU? I run it on a P40 gpu.

P40 not support fp16 precision, so it does not work. We have had tested on 3080/4090/A100/A10. Let me update the doc.

tpoisonooo · 2023-08-23T09:15:18Z

For alpaca model, lmdeploy still needs a chat template, so this PR is WIP.
I will update PR status after finished.

cc @ymcui @GoGoJoestar

docs(README): add lmdeploy

37f6b3f

tpoisonooo changed the title ~~docs(README): add lmdeploy~~ WIP docs(README): add lmdeploy Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP docs(README): add lmdeploy #152

WIP docs(README): add lmdeploy #152

tpoisonooo commented Aug 17, 2023

tpoisonooo commented Aug 18, 2023

ymcui commented Aug 18, 2023

GoGoJoestar commented Aug 21, 2023

tpoisonooo commented Aug 23, 2023

tpoisonooo commented Aug 23, 2023

WIP docs(README): add lmdeploy #152

Are you sure you want to change the base?

WIP docs(README): add lmdeploy #152

Conversation

tpoisonooo commented Aug 17, 2023

lmdeploy 安装和使用

一、安装

二、fp16 推理

三、kv cache int8 量化

四、weight int4 量化

lmdeploy Usage

I. Installation

II. fp16 Inference

III. kv cache int8 Quantization

IV. weight int4 Quantization

tpoisonooo commented Aug 18, 2023

ymcui commented Aug 18, 2023

GoGoJoestar commented Aug 21, 2023

tpoisonooo commented Aug 23, 2023

tpoisonooo commented Aug 23, 2023