Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified paging #860

Merged
merged 86 commits into from
Dec 18, 2023
Merged
Changes from 1 commit
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
9c3634e
change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' …
lvhan028 Oct 18, 2023
eb3b4dc
avoid split chinese characters during decoding (#566)
AllentDan Oct 18, 2023
70a5c63
add solar chat template (#576)
AllentDan Oct 19, 2023
186bfd2
robust incremental decode for leading space (#581)
AllentDan Oct 19, 2023
baf1801
update solar chat template (#587)
AllentDan Oct 23, 2023
af2f072
Revert "[Docs] Simplify `build.md` (#370)" (#586)
pppppM Oct 23, 2023
ffe4ba9
Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591)
irexyc Oct 24, 2023
96f1b8e
bump version to v0.0.12 (#604)
lvhan028 Oct 24, 2023
7283781
Add "build from docker" section (#602)
lvhan028 Oct 25, 2023
169d516
Add more user-friendly CLI (#541)
RunningLeon Oct 25, 2023
ac3500b
support inference a batch of prompts (#467)
AllentDan Oct 25, 2023
56942c4
bump version to v0.0.13 (#620)
lvhan028 Oct 30, 2023
373bd01
Improve api_server and webui usage (#544)
AllentDan Nov 1, 2023
6e91e5c
fix: gradio gr.Button.update deprecated after 4.0.0 (#637)
hscspring Nov 3, 2023
1bbc6e0
add cli to list the supported model names (#639)
RunningLeon Nov 3, 2023
823ad84
Refactor model conversion (#296)
irexyc Nov 3, 2023
994027f
[Enchance] internlm message to prompt (#499)
Harold-lkk Nov 3, 2023
15d1cc2
update turbomind session_len with model.session_len (#634)
AllentDan Nov 3, 2023
c15fbf4
[Fix] Qwen's quantization results are abnormal & Baichuan cannot be q…
pppppM Nov 3, 2023
85d2f66
FIX: fix stop_session func bug (#578)
yunzhongyan0 Nov 6, 2023
11d1093
Manage session id using random int for gradio local mode (#553)
aisensiy Nov 6, 2023
529e56b
fix benchmark serving computation mistake (#630)
AllentDan Nov 8, 2023
9febf61
fix tokenizer_info when convert the model (#661)
irexyc Nov 8, 2023
013000d
Add check env sub command (#654)
RunningLeon Nov 8, 2023
18170ee
fix Tokenizer load error when the path of the being-converted model …
irexyc Nov 9, 2023
7749128
Add UltraCM and WizardLM chat templates (#599)
AllentDan Nov 9, 2023
7b20cfd
bump version to v0.0.14 (#663)
lvhan028 Nov 9, 2023
0612596
Add extra_requires to reduce dependencies (#580)
RunningLeon Nov 10, 2023
ab1767c
TurboMind 2 (#590)
lzhangzz Nov 10, 2023
e641dd8
[Docs] Update Supported Matrix (#679)
pppppM Nov 13, 2023
b7c88ca
update kv8 docs (#681)
pppppM Nov 13, 2023
4eb8dd8
Fix init of batch state (#682)
lzhangzz Nov 14, 2023
7d40d19
fix turbomind stream canceling (#686)
grimoire Nov 15, 2023
0fcc303
[Fix] Fix load_checkpoint_in_model bug (#690)
HIT-cwh Nov 16, 2023
c02e281
[Doc] Update restful api doc (#662)
AllentDan Nov 19, 2023
07640a3
Fix Tokenizer encode (#645)
AllentDan Nov 19, 2023
65d735b
Fix wrong eos_id and bos_id obtained through grpc api (#644)
lvhan028 Nov 20, 2023
911c0a8
Optimize for throughput (#701)
lzhangzz Nov 20, 2023
73386e2
Check-in user guide about turbomind config (#680)
lvhan028 Nov 20, 2023
42e57c8
Replace mmengine with mmengine-lite (#715)
zhouzaida Nov 21, 2023
6b00f62
Support loading hf model directly (#685)
irexyc Nov 22, 2023
434961c
Fix cache/output length calculation (#738)
lzhangzz Nov 23, 2023
d338635
bump version to v0.1.0a0 (#709)
lvhan028 Nov 23, 2023
a7c5007
[Fix] Skip empty batch (#747)
lzhangzz Nov 23, 2023
c07f60f
[Fix] build docker image failed since `packaging` is missing (#753)
lvhan028 Nov 24, 2023
4bcc4f1
[Fix] Rollback the data type of input_ids to TYPE_UINT32 in preproce…
lvhan028 Nov 27, 2023
7868cea
Set the default value of `max_context_token_num` 1 (#761)
lvhan028 Nov 27, 2023
a94cff8
rename pytorch poc
grimoire Nov 27, 2023
dfced00
fix lint
grimoire Nov 27, 2023
d267d31
add docstring
grimoire Nov 27, 2023
9d13761
add docstring
grimoire Nov 28, 2023
0024009
refactor patch
grimoire Nov 28, 2023
dfe3322
add recompute eviction support
grimoire Nov 28, 2023
2f80c55
fix typo (#769)
grimoire Nov 28, 2023
4744b28
add triton server test and workflow yml (#760)
RunningLeon Nov 29, 2023
7cbd2dd
recovery modeling
grimoire Nov 29, 2023
8c672a7
fix turbomind build on sm<80 (#754)
grimoire Nov 29, 2023
8add942
improvement(build): enable ninja and gold linker (#767)
tpoisonooo Nov 29, 2023
5c9e1e2
Report first-token-latency and token-latency percentiles (#736)
lvhan028 Nov 29, 2023
77efebb
convert model with hf repo_id (#774)
irexyc Nov 29, 2023
9c46b27
bump version to 0.1.0a1 (#776)
lvhan028 Nov 29, 2023
d3e2cee
Update benchmark user guide (#763)
lvhan028 Nov 29, 2023
b0f9d3f
Merge branch 'pytorch-poc' into rename-pytorch
grimoire Nov 30, 2023
809e7b3
add docstring
grimoire Nov 30, 2023
1e4fae6
add unified paging attention support
grimoire Nov 29, 2023
38a7a16
refactor block manager
grimoire Dec 1, 2023
7ba559b
do not alloc zero
grimoire Dec 2, 2023
816022e
Fix early exit condition in attention kernel (#788)
lzhangzz Dec 2, 2023
12dc3e1
add chat template for Yi (#779)
AllentDan Dec 4, 2023
2ba9082
Fix missed arguments when benchmark static inference performance (#787)
lvhan028 Dec 4, 2023
7f943a2
Unify prefill & decode passes (#775)
lzhangzz Dec 4, 2023
7990d25
add cuda12.1 build check ci (#782)
irexyc Dec 4, 2023
079f29b
auto upload cuda12.1 python pkg to release when create new tag (#784)
irexyc Dec 5, 2023
bd7c4e3
fix extra colon in InternLMChat7B (#796)
C1rN09 Dec 5, 2023
5b9e454
fix local kv head num (#806)
lvhan028 Dec 6, 2023
ebe90bc
Report the inference benchmark of models with different size (#794)
lvhan028 Dec 6, 2023
fddad30
bump version to v0.1.0a2 (#807)
lvhan028 Dec 6, 2023
2d5f5b3
fix out of bounds access (#809)
lzhangzz Dec 7, 2023
71011dd
update scheduler
grimoire Dec 8, 2023
e2efd55
optimize request
grimoire Dec 8, 2023
a54b16a
Simplify block manager (#812)
lzhangzz Dec 11, 2023
d5a8946
set smem size for repetition penalty kernel (#818)
lzhangzz Dec 11, 2023
aa83317
add mbgemm&mbgemv
grimoire Dec 11, 2023
d08a126
merge main
grimoire Dec 11, 2023
8e38536
fix recompute, fix mbgmm
grimoire Dec 15, 2023
08c1719
merge
grimoire Dec 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add more user-friendly CLI (#541)
* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli
RunningLeon authored Oct 25, 2023
commit 169d5169fe4f805f39eef4a5b0aa2fe480190afe
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -119,14 +119,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### Inference by TurboMind

```shell
python -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
@@ -140,7 +140,7 @@ python -m lmdeploy.turbomind.chat ./workspace
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
Launch inference server by:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.
@@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line,

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
@@ -200,7 +200,7 @@ For detailed instructions on Inference pytorch models, see [here](docs/en/pytorc
#### Single GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
20 changes: 10 additions & 10 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
@@ -120,14 +120,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### 使用 turbomind 推理

```shell
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
@@ -140,7 +140,7 @@ python3 -m lmdeploy.turbomind.chat ./workspace
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
使用下面的命令启动推理服务:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port${server_port} --restful_api True
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)
@@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
@@ -204,7 +204,7 @@ pip install deepspeed
#### 单个 GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL\
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
8 changes: 4 additions & 4 deletions docs/en/kv_int8.md
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@ dequant: f = q * scale + zp
Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.

```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```

If you already have a workspace directory, skip this step.
@@ -29,15 +29,15 @@ Get the quantization parameters by these two steps:

```bash
# get minmax
python3 -m lmdeploy.lite.apis.calibrate \
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \ # Support c4, ptb, wikitext2, pileval
--calib_samples 128 \ # Number of samples in the calibration set, if the memory is not enough, it can be adjusted appropriately
--calib_seqlen 2048 \ # Length of a single text, if the memory is not enough, you can adjust it appropriately
--work_dir $WORK_DIR \ # Directory for saving quantized statistical parameters and quantized weights in Pytorch format

# get quant parameters
python3 -m lmdeploy.lite.apis.kv_qparams \
lmdeploy lite kv_qparams \
--work_dir $WORK_DIR \ # Directory of the last output
--turbomind_dir workspace/triton_models/weights/ \ # Directory to save the quantization parameters
--kv_sym False \ # Symmetric or asymmetric quantization, default is False
@@ -64,7 +64,7 @@ Considering there are four combinations of kernels needed to be implemented, pre
Test the chat performance.

```bash
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

## GPU Memory Test
6 changes: 3 additions & 3 deletions docs/en/pytorch.md
Original file line number Diff line number Diff line change
@@ -9,21 +9,21 @@ This submodule allow user to chat with language model through command line, and
**Example 1**: Chat with default setting

```shell
python -m lmdeploy.pytorch.chat $PATH_TO_HF_MODEL
lmdeploy chat torch $PATH_TO_HF_MODEL
```

**Example 2**: Disable sampling and chat history

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--temperature 0 --max-history 0
```

**Example 3**: Accelerate with deepspeed inference

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--accel deepspeed
```
8 changes: 4 additions & 4 deletions docs/en/restful_api.md
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@
### Launch Service

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace 0.0.0.0 server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

Then, the user can open the swagger UI: `http://{server_ip}:{server_port}` for the detailed api usage.
@@ -125,7 +125,7 @@ There is a client script for restful api server.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

### webui
@@ -135,8 +135,8 @@ You can also test restful-api through webui.
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

### FAQ
18 changes: 9 additions & 9 deletions docs/en/serving.md
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ You can download [llama-2 models from huggingface](https://huggingface.co/meta-l
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```

@@ -18,7 +18,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```

@@ -28,7 +28,7 @@ bash workspace/service_docker_up.sh
<summary><b>70B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```

@@ -42,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
lmdeploy convert llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
@@ -53,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
lmdeploy convert llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
@@ -64,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
lmdeploy convert llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
@@ -75,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
lmdeploy convert llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
@@ -94,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
lmdeploy convert vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```

@@ -110,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
lmdeploy convert vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```

18 changes: 9 additions & 9 deletions docs/en/supported_models/codellama.md
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ Based on the above table, download the model that meets your requirements. Execu
python3 -m pip install lmdeploy

# convert weight layout
python3 -m lmdeploy.serve.turbomind.deploy codellama /the/path/of/codellama/model
lmdeploy convert codellama /the/path/of/codellama/model
```

Then, you can communicate with codellama in consolo by following instructions in next sections
@@ -42,13 +42,13 @@ Then, you can communicate with codellama in consolo by following instructions in
### Completion

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap completion
lmdeploy chat turbomind ./workspace --cap completion
```

### Infilling

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap infilling
lmdeploy chat turbomind ./workspace --cap infilling
```

The input code is supposed to have a special placeholder `<FILL>`. For example,
@@ -64,15 +64,15 @@ And the generated code piece by `turbomind.chat` is the one to be filled in `<FI
### Chat

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap chat --sys-instruct "Provide answers in Python"
lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
```

`--sys-instruct` instruction can be changed to other coding languages as long as codellama supports it

### Python specialist

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap python
lmdeploy chat turbomind ./workspace --cap python
```

Python fine-tuned model is highly recommended when 'python specialist' capability is required.
@@ -90,23 +90,23 @@ Launch inference server by:
```shell
# --instance_num: number of instances to performance inference, which can be viewed as max requests concurrency
# --tp: the number of GPUs used in tensor parallelism
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name ${server_ip} --server_port ${server_port} --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or through webui after launching gradio,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
Loading