-
Notifications
You must be signed in to change notification settings - Fork 462
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support loading hf model directly (#685)
* turbomind support export model params * fix overflow * support turbomind.from_pretrained * fix tp * support AutoModel * support load kv qparams * update auto_awq * udpate docstring * export lmdeploy version * update doc * remove download_hf_repo * LmdeployForCausalLM -> LmdeployForCausalLM * refactor turbomind.py * update comment * add bfloat16 convert back * support gradio run_locl load hf * support resuful api server load hf * add docs * support loading previous quantized model * adapt pr 690 * udpate docs * not export turbomind config when quantize a model * check model_name when can not get it from config.json * update readme * remove model_name in auto_awq * update * update * udpate * fix build * absolute import
- Loading branch information
Showing
29 changed files
with
1,196 additions
and
232 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Load huggingface model directly | ||
|
||
Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models. | ||
|
||
## Supported model type | ||
|
||
Currently, Turbomind support loading three types of model: | ||
|
||
1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc. | ||
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat | ||
3. A model converted by `lmdeploy convert`, legacy format | ||
|
||
## Usage | ||
|
||
### 1) A lmdeploy-quantized model | ||
|
||
For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc. | ||
|
||
``` | ||
repo_id=internlm/internlm-chat-20b-4bit | ||
model_name=internlm-chat-20b | ||
# or | ||
# repo_id=/path/to/downloaded_model | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind $repo_id --model-name $model_name | ||
# Serving with gradio | ||
lmdeploy serve gradio $repo_id --model-name $model_name | ||
# Serving with Restful API | ||
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1 | ||
``` | ||
|
||
### 2) Other LM models | ||
|
||
For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`. | ||
|
||
``` | ||
repo_id=Qwen/Qwen-7B-Chat | ||
model_name=qwen-7b | ||
# or | ||
# repo_id=/path/to/Qwen-7B-Chat/local_path | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind $repo_id --model-name $model_name | ||
# Serving with gradio | ||
lmdeploy serve gradio $repo_id --model-name $model_name | ||
# Serving with Restful API | ||
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1 | ||
``` | ||
|
||
### 3) A model converted by `lmdeploy convert` | ||
|
||
The usage is like previous | ||
|
||
``` | ||
# Convert a model | ||
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind ./workspace | ||
# Serving with gradio | ||
lmdeploy serve gradio ./workspace | ||
# Serving with Restful API | ||
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# 直接读取 huggingface 模型 | ||
|
||
从 v0.1.0 开始,Turbomid 添加了直接读取 Huggingface 格式权重的能力。 | ||
|
||
## 支持的类型 | ||
|
||
目前,TurboMind 支持加载三种类型的模型: | ||
|
||
1. 在 huggingface.co 上面通过 lmdeploy 量化的模型,如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit) | ||
2. huggingface.co 上面其他 LM 模型,如Qwen/Qwen-7B-Chat | ||
3. 通过 `lmdeploy convert` 命令转换好的模型,兼容旧格式 | ||
|
||
## 使用方式 | ||
|
||
### 1) 通过 lmdeploy 量化的模型 | ||
|
||
对于通过 `lmdeploy.lite` 量化的模型,TurboMind 可以直接加载,比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit). | ||
|
||
``` | ||
repo_id=internlm/internlm-chat-20b-4bit | ||
model_name=internlm-chat-20b | ||
# or | ||
# repo_id=/path/to/downloaded_model | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind $repo_id --model-name $model_name | ||
# Serving with gradio | ||
lmdeploy serve gradio $repo_id --model-name $model_name | ||
# Serving with Restful API | ||
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1 | ||
``` | ||
|
||
### 2) 其他的 LM 模型 | ||
|
||
其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。 | ||
|
||
``` | ||
repo_id=Qwen/Qwen-7B-Chat | ||
model_name=qwen-7b | ||
# or | ||
# repo_id=/path/to/Qwen-7B-Chat/local_path | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind $repo_id --model-name $model_name | ||
# Serving with gradio | ||
lmdeploy serve gradio $repo_id --model-name $model_name | ||
# Serving with Restful API | ||
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1 | ||
``` | ||
|
||
### 3) 通过 `lmdeploy convert` 命令转换好的模型 | ||
|
||
使用方式与之前相同 | ||
|
||
``` | ||
# Convert a model | ||
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME | ||
# Inference by TurboMind | ||
lmdeploy chat turbomind ./workspace | ||
# Serving with gradio | ||
lmdeploy serve gradio ./workspace | ||
# Serving with Restful API | ||
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1 | ||
``` |
Oops, something went wrong.