Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more user-friendly CLI #541

Merged
merged 9 commits into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,14 +119,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### Inference by TurboMind

```shell
python -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python -m lmdeploy.turbomind.chat ./workspace
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
Launch inference server by:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line,

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
Expand All @@ -200,7 +200,7 @@ For detailed instructions on Inference pytorch models, see [here](docs/en/pytorc
#### Single GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
20 changes: 10 additions & 10 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### 使用 turbomind 推理

```shell
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python3 -m lmdeploy.turbomind.chat ./workspace
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
使用下面的命令启动推理服务:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port${server_port} --restful_api True
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)。
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
Expand All @@ -204,7 +204,7 @@ pip install deepspeed
#### 单个 GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL\
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
8 changes: 4 additions & 4 deletions docs/en/kv_int8.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ dequant: f = q * scale + zp
Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.

```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```

If you already have a workspace directory, skip this step.
Expand All @@ -29,15 +29,15 @@ Get the quantization parameters by these two steps:

```bash
# get minmax
python3 -m lmdeploy.lite.apis.calibrate \
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \ # Support c4, ptb, wikitext2, pileval
--calib_samples 128 \ # Number of samples in the calibration set, if the memory is not enough, it can be adjusted appropriately
--calib_seqlen 2048 \ # Length of a single text, if the memory is not enough, you can adjust it appropriately
--work_dir $WORK_DIR \ # Directory for saving quantized statistical parameters and quantized weights in Pytorch format

# get quant parameters
python3 -m lmdeploy.lite.apis.kv_qparams \
lmdeploy lite kv_qparams \
--work_dir $WORK_DIR \ # Directory of the last output
--turbomind_dir workspace/triton_models/weights/ \ # Directory to save the quantization parameters
--kv_sym False \ # Symmetric or asymmetric quantization, default is False
Expand All @@ -64,7 +64,7 @@ Considering there are four combinations of kernels needed to be implemented, pre
Test the chat performance.

```bash
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

## GPU Memory Test
Expand Down
6 changes: 3 additions & 3 deletions docs/en/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,21 @@ This submodule allow user to chat with language model through command line, and
**Example 1**: Chat with default setting

```shell
python -m lmdeploy.pytorch.chat $PATH_TO_HF_MODEL
lmdeploy chat torch $PATH_TO_HF_MODEL
```

**Example 2**: Disable sampling and chat history

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--temperature 0 --max-history 0
```

**Example 3**: Accelerate with deepspeed inference

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--accel deepspeed
```
Expand Down
8 changes: 4 additions & 4 deletions docs/en/restful_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
### Launch Service

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace 0.0.0.0 server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

Then, the user can open the swagger UI: `http://{server_ip}:{server_port}` for the detailed api usage.
Expand Down Expand Up @@ -125,7 +125,7 @@ There is a client script for restful api server.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

### webui
Expand All @@ -135,8 +135,8 @@ You can also test restful-api through webui.
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

### FAQ
Expand Down
18 changes: 9 additions & 9 deletions docs/en/serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ You can download [llama-2 models from huggingface](https://huggingface.co/meta-l
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```

Expand All @@ -18,7 +18,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```

Expand All @@ -28,7 +28,7 @@ bash workspace/service_docker_up.sh
<summary><b>70B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```

Expand All @@ -42,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
lmdeploy convert llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
Expand All @@ -53,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
lmdeploy convert llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
Expand All @@ -64,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
lmdeploy convert llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
Expand All @@ -75,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
lmdeploy convert llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
Expand All @@ -94,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
lmdeploy convert vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```

Expand All @@ -110,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
lmdeploy convert vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```

Expand Down
18 changes: 9 additions & 9 deletions docs/en/supported_models/codellama.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Based on the above table, download the model that meets your requirements. Execu
python3 -m pip install lmdeploy

# convert weight layout
python3 -m lmdeploy.serve.turbomind.deploy codellama /the/path/of/codellama/model
lmdeploy convert codellama /the/path/of/codellama/model
```

Then, you can communicate with codellama in consolo by following instructions in next sections
Expand All @@ -42,13 +42,13 @@ Then, you can communicate with codellama in consolo by following instructions in
### Completion

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap completion
lmdeploy chat turbomind ./workspace --cap completion
```

### Infilling

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap infilling
lmdeploy chat turbomind ./workspace --cap infilling
```

The input code is supposed to have a special placeholder `<FILL>`. For example,
Expand All @@ -64,15 +64,15 @@ And the generated code piece by `turbomind.chat` is the one to be filled in `<FI
### Chat

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap chat --sys-instruct "Provide answers in Python"
lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
```

`--sys-instruct` instruction can be changed to other coding languages as long as codellama supports it

### Python specialist

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap python
lmdeploy chat turbomind ./workspace --cap python
```

Python fine-tuned model is highly recommended when 'python specialist' capability is required.
Expand All @@ -90,23 +90,23 @@ Launch inference server by:
```shell
# --instance_num: number of instances to performance inference, which can be viewed as max requests concurrency
# --tp: the number of GPUs used in tensor parallelism
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name ${server_ip} --server_port ${server_port} --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or through webui after launching gradio,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
Loading