INT4 模型量化和部署

LMDeploy 使用 AWQ 算法，实现模型 4bit 权重量化。推理引擎 TurboMind 提供了非常高效的 4bit 推理 cuda kernel，性能是 FP16 的 2.4 倍以上。它支持以下 NVIDIA 显卡：

图灵架构（sm75）：20系列、T4
安培架构（sm80,sm86）：30系列、A10、A16、A30、A100
Ada Lovelace架构（sm89）：40 系列

在量化和部署之前，请确保安装了 lmdeploy.

pip install lmdeploy[all]

本文由以下章节组成：

模型量化
模型评测
模型推理
推理服务
推理性能

模型量化

仅需执行一条命令，就可以完成模型量化工作。量化结束后，权重文件存放在 $WORK_DIR 下。

export HF_MODEL=internlm/internlm2_5-7b-chat
export WORK_DIR=internlm/internlm2_5-7b-chat-4bit

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir $WORK_DIR

绝大多数情况下，在执行上述命令时，可选参数可不用填写，使用默认的即可。比如量化 internlm/internlm2_5-7b-chat 模型，命令可以简化为：

lmdeploy lite auto_awq internlm/ianternlm2-chat-7b --work-dir internlm2_5-7b-chat-4bit

Note:

我们建议 --work-dir 参数带有模型名字，就像上面的例子展示的那样。这样在推理时，就不用指定对话模板了。因为推理接口会以模糊搜索方式，选出和 --work-dir 近似的对话模板
如果量化模型精度有损，建议开启 --search-scale 重新量化，并调大 --batch-size，比如 8。search_scale 开启后，量化过程会比较耗时。--batch-size 会影响内存占用量，可以根据实际情况，酌情调整。

量化后的模型，可以用一些工具快速验证对话效果。

比如，直接在控制台和模型对话，

lmdeploy chat ./internlm2_5-7b-chat-4bit --model-format awq

或者，启动gradio服务，

lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server-name {ip_addr} --server-port {port} --model-format awq

然后，在浏览器中打开 http://{ip_addr}:{port}，即可在线对话

模型评测

我们使用 OpenCompass 评测量化模型在各个维度上的能力

模型推理

量化后的模型，通过以下几行简单的代码，可以实现离线推理：

from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("./internlm2_5-7b-chat-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

关于 pipeline 的详细介绍，请参考这里

除了推理本地量化模型外，LMDeploy 还支持直接推理 huggingface hub 上的通过 AWQ 量化的 4bit 权重模型，比如 lmdeploy 空间和 TheBloke 空间下的模型。

# 推理 lmdeploy 空间下的模型
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("lmdeploy/llama2-chat-70b-4bit",
                backend_config=TurbomindEngineConfig(model_format='awq', tp=4))
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

# 推理 TheBloke 空间下的模型（试试codellama行不行）
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
pipe = pipeline("TheBloke/LLaMA2-13B-Tiefighter-AWQ",
                backend_config=TurbomindEngineConfig(model_format='awq'),
                chat_template_config=ChatTemplateConfig(model_name='llama2')
                )
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

推理服务

LMDeploy api_server 支持把模型一键封装为服务，对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例：

lmdeploy serve api_server ./internlm2_5-7b-chat-4bit --backend turbomind --model-format awq

服务默认端口是23333。在 server 启动后，你可以在终端通过api_client与server进行对话：

lmdeploy serve api_client http://0.0.0.0:23333

还可以通过 Swagger UI http://0.0.0.0:23333 在线阅读和试用 api_server 的各接口，也可直接查阅文档，了解各接口的定义和使用方法。

推理性能

我们在 NVIDIA GeForce RTX 4090 上使用 profile_generation.py，分别测试了 4-bit Llama-2-7B-chat 和 Llama-2-13B-chat 模型的 token 生成速度。测试配置为 batch size = 1，(prompt_tokens, completion_tokens) = (1, 512)

model	llm-awq	mlc-llm	turbomind
Llama-2-7B-chat	112.9	159.4	206.4
Llama-2-13B-chat	N/A	90.7	115.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w4a16.md

w4a16.md

INT4 模型量化和部署

模型量化

模型评测

模型推理

推理服务

推理性能

Files

w4a16.md

Latest commit

History

w4a16.md

File metadata and controls

INT4 模型量化和部署

模型量化

模型评测

模型推理

推理服务

推理性能