-
Notifications
You must be signed in to change notification settings - Fork 567
Support fp8 w8a8 for pt backend #2959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
d663d4c
support w8a8 smooth_quant and loading
RunningLeon 77f5fb8
optimize int8
grimoire a1081b3
fix fp8 kernels
RunningLeon e71c4cc
update docs for w8a8
RunningLeon 2dba1a7
resolve comments
RunningLeon 9647661
Merge branch 'main' into support-pt-fp8
RunningLeon 9d0f467
resolve comments
RunningLeon 8ef2f9a
fix ut
RunningLeon fdcfd8b
disable not quant last norm
RunningLeon 44b3aa1
disable quant last norm for cogvlm and minicpmv26 models
RunningLeon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,74 @@ | ||
# SmoothQuant | ||
|
||
LMDeploy provides functions for quantization and inference of large language models using 8-bit integers. | ||
LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For GPUs such as Nvidia H100, lmdeploy also supports 8-bit floating point(FP8). | ||
|
||
Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these: | ||
And the following NVIDIA GPUs are available for INT8/FP8 inference respectively: | ||
|
||
- INT8 | ||
- V100(sm70): V100 | ||
- Turing(sm75): 20 series, T4 | ||
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100 | ||
- Ada Lovelace(sm89): 40 series | ||
- Hopper(sm90): H100 | ||
- FP8 | ||
- Ada Lovelace(sm89): 40 series | ||
- Hopper(sm90): H100 | ||
|
||
First of all, run the following command to install lmdeploy: | ||
|
||
```shell | ||
pip install lmdeploy | ||
pip install triton>=2.1.0 | ||
pip install lmdeploy[all] | ||
``` | ||
|
||
## 8-bit Weight Model Inference | ||
## 8-bit Weight Quantization | ||
|
||
For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy). For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo: | ||
Performing 8-bit weight quantization involves three steps: | ||
|
||
```shell | ||
git-lfs install | ||
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon) | ||
``` | ||
1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing. | ||
2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file. | ||
3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model. | ||
|
||
Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the ["8bit Weight Quantization"](#8bit-weight-quantization) section. Save them in the internlm-chat-7b-w8 directory, using the command below: | ||
lmdeploy provides `lmdeploy lite smooth_quant` command to accomplish all three tasks detailed above. Note that the argument `--quant-dtype` is used to determine if you are doing int8 or fp8 weight quantization. To get more info about usage of the cli, run `lmdeploy lite smooth_quant --help` | ||
|
||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 | ||
``` | ||
Here are two examples: | ||
|
||
Afterwards, use the following command to interact with the model via the terminal: | ||
- int8 | ||
|
||
```shell | ||
lmdeploy chat ./internlm-chat-7b-w8 --backend pytorch | ||
``` | ||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8 | ||
``` | ||
|
||
## Launching gradio service | ||
- fp8 | ||
|
||
Coming soon... | ||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8 | ||
``` | ||
|
||
## Inference Speed | ||
## Inference | ||
|
||
Coming soon... | ||
Trying the following codes, you can perform the batched offline inference with the quantized model: | ||
|
||
## 8bit Weight Quantization | ||
```python | ||
from lmdeploy import pipeline, PytorchEngineConfig | ||
|
||
Performing 8bit weight quantization involves three steps: | ||
engine_config = PytorchEngineConfig(tp=1) | ||
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config) | ||
response = pipe(["Hi, pls intro yourself", "Shanghai is"]) | ||
print(response) | ||
``` | ||
|
||
1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing. | ||
2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file. | ||
3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model. | ||
## Service | ||
|
||
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: | ||
|
||
```shell | ||
lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch | ||
``` | ||
|
||
The script `lmdeploy/lite/apis/smooth_quant.py` accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command: | ||
The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`: | ||
|
||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 | ||
lmdeploy serve api_client http://0.0.0.0:23333 | ||
``` | ||
|
||
After saving, you can instantiate your quantized model by calling the from_pretrained interface. | ||
You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../llm/api_server.md). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,76 @@ | ||
# W8A8 LLM 模型部署 | ||
|
||
LMDeploy 提供了使用 8 bit 整数对神经网络模型进行量化和推理的功能。 | ||
LMDeploy 提供了使用 8-bit 整数(INT8)和浮点数(FP8)对神经网络模型进行量化和推理的功能。 | ||
|
||
在开始推理前,需要确保已经正确安装了 lmdeploy 和 openai/triton。可以通过以下命令进行安装: | ||
可用于 INT8 和 FP8 推理的 NVIDIA GPU 分别为: | ||
|
||
- INT8 | ||
- V100(sm70): V100 | ||
- Turing(sm75): 20 series, T4 | ||
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100 | ||
- Ada Lovelace(sm89): 40 series | ||
- Hopper(sm90): H100 | ||
- FP8 | ||
- Ada Lovelace(sm89): 40 series | ||
- Hopper(sm90): H100 | ||
|
||
首先,执行如下命令安装lmdeploy: | ||
|
||
```shell | ||
pip install lmdeploy | ||
pip install triton>=2.1.0 | ||
pip install lmdeploy[all] | ||
``` | ||
|
||
## 8bit 权重模型推理 | ||
## 8-bit 权重量化 | ||
|
||
如果你需要进行 8 bit 权重模型推理,可以直接从 LMDeploy 的 [model zoo](https://huggingface.co/lmdeploy) 下载已经量化好的 8bit 权重模型。以8bit 的 Internlm-chat-7B 模型为例,可以从 model zoo 直接下载: | ||
进行 8-bit 权重量化需要经历以下三步: | ||
|
||
```shell | ||
git-lfs install | ||
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon) | ||
``` | ||
1. **权重平滑**:首先对语言模型的权重进行平滑处理,以便更好地进行量化。 | ||
2. **模块替换**:使用 `QRMSNorm` 和 `QLinear` 模块替换原模型 `DecoderLayer` 中的 `RMSNorm` 模块和 `nn.Linear` 模块。`lmdeploy/pytorch/models/q_modules.py` 文件中定义了这些量化模块。 | ||
3. **保存量化模型**:完成上述必要的替换后,我们即可保存新的量化模型。 | ||
|
||
你也可以参考["8bit 权重量化"](#8bit-权重量化)章节的内容手动将原 16bit 权重量化为 8bit,并保存至 `internlm-chat-7b-w8` 目录下,操作命令如下: | ||
lmdeploy 提供了命令行工具 `lmdeploy lite smooth_quant` 实现了以上三个步骤。并且其中命令行参数 `--quant-dtype` 可以用来控制是进行8-bit整数还是浮点数类型的量化。更多命令行工具使用方式,请执行 `lmdeploy lite smooth_quant --help` 查看。 | ||
|
||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 | ||
``` | ||
以下示例演示了进行 int8 或 fp8 的量化命令。 | ||
|
||
然后,执行以下命令,即可在终端与模型对话: | ||
- int8 | ||
|
||
```shell | ||
lmdeploy chat ./internlm-chat-7b-w8 --backend pytorch | ||
``` | ||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8 | ||
``` | ||
|
||
## 启动 gradio 服务 | ||
- fp8 | ||
|
||
Coming soon... | ||
```shell | ||
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8 | ||
``` | ||
|
||
## 推理速度 | ||
## 模型推理 | ||
|
||
Coming soon... | ||
量化后的模型,通过以下几行简单的代码,可以实现离线推理: | ||
|
||
## 8bit 权重量化 | ||
```python | ||
from lmdeploy import pipeline, PytorchEngineConfig | ||
|
||
进行 8bit 权重量化需要经历以下三步: | ||
engine_config = PytorchEngineConfig(tp=1) | ||
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config) | ||
response = pipe(["Hi, pls intro yourself", "Shanghai is"]) | ||
print(response) | ||
``` | ||
|
||
1. **权重平滑**:首先对语言模型的权重进行平滑处理,以便更好地进行量化。 | ||
2. **模块替换**:使用 `QRSMNorm` 和 `QLinear` 模块替换原模型 `DecoderLayer` 中的 `RSMNorm` 模块和 `nn.Linear` 模块。`lmdeploy/pytorch/models/q_modules.py` 文件中定义了这些量化模块。 | ||
3. **保存量化模型**:完成上述必要的替换后,我们即可保存新的量化模型。 | ||
关于 pipeline 的详细介绍,请参考[这里](../llm/pipeline.md) | ||
|
||
我们在`lmdeploy/lite/api/smooth_quantity.py`脚本中已经实现了以上三个步骤。例如,可以通过以下命令得到量化后的 Internlm-chat-7B 模型的模型权重: | ||
## 推理服务 | ||
|
||
LMDeploy `api_server` 支持把模型一键封装为服务,对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例: | ||
|
||
```shell | ||
lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch | ||
``` | ||
|
||
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 | ||
服务默认端口是23333。在 server 启动后,你可以在终端通过`api_client`与server进行对话: | ||
|
||
```shell | ||
lmdeploy serve api_client http://0.0.0.0:23333 | ||
``` | ||
|
||
保存之后,你就可以通过调用from_pretrained接口来实例化你的量化模型。 | ||
还可以通过 Swagger UI `http://0.0.0.0:23333` 在线阅读和试用 `api_server` 的各接口,也可直接查阅[文档](../llm/api_server.md),了解各接口的定义和使用方法。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.