Skip to content

Commit

Permalink
Merge branch 'main' into slora-pipe
Browse files Browse the repository at this point in the history
Conflicts:
	lmdeploy/serve/async_engine.py
  • Loading branch information
AllentDan committed Mar 15, 2024
2 parents 5bd31fb + 4620ed3 commit 3c3e373
Show file tree
Hide file tree
Showing 44 changed files with 1,482 additions and 146 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ ______________________________________________________________________
<summary><b>2024</b></summary>

- \[2024/02\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/restful_api.md).
- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/api_server.md).
- \[2024/01\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](./docs/en/serving/proxy_server.md)
- \[2024/01\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.

Expand Down Expand Up @@ -152,7 +152,7 @@ For detailed user guides and advanced guides, please refer to our [tutorials](ht
- [Inference pipeline](./docs/en/inference/pipeline.md)
- [Inference Engine - TurboMind](docs/en/inference/turbomind.md)
- [Inference Engine - PyTorch](docs/en/inference/pytorch.md)
- [Serving](docs/en/serving/restful_api.md)
- [Serving](docs/en/serving/api_server.md)
- [Quantization](docs/en/quantization)
- Advance Guide
- Add chat template
Expand Down
4 changes: 2 additions & 2 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ ______________________________________________________________________
<summary><b>2024</b></summary>

- \[2024/02\] 支持 Qwen 1.5、Gemma、Mistral、Mixtral、Deepseek-MOE 等模型
- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) 发布,支持无缝接入[LMDeploy Serving Service](./docs/zh_cn/serving/restful_api.md)
- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) 发布,支持无缝接入[LMDeploy Serving Service](./docs/zh_cn/serving/api_server.md)
- \[2024/01\] 支持多模型、多机、多卡推理服务。使用方法请参考[此处](./docs/zh_cn/serving/proxy_server.md)
- \[2024/01\] 增加 [PyTorch 推理引擎](./docs/zh_cn/inference/pytorch.md),作为 TurboMind 引擎的补充。帮助降低开发门槛,和快速实验新特性、新技术

Expand Down Expand Up @@ -153,7 +153,7 @@ print(response)
- [推理pipeline](./docs/zh_cn/inference/pipeline.md)
- [推理引擎 - TurboMind](./docs/zh_cn/inference/turbomind.md)
- [推理引擎 - PyTorch](./docs/zh_cn/inference/pytorch.md)
- [推理服务](./docs/zh_cn/serving/restful_api.md)
- [推理服务](./docs/zh_cn/serving/api_server.md)
- [模型量化](./docs/zh_cn/quantization)
- 进阶指南
- 增加对话模板
Expand Down
4 changes: 2 additions & 2 deletions benchmark/profile_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ def _inference(self, req_queue: Queue, res_queue: Queue, session_id: int,

chatbot = Chatbot(self.server_addr,
ignore_eos=True,
profile_serving=True,
top_k=self.top_k,
top_p=self.top_p,
temperature=self.temperature,
Expand Down Expand Up @@ -153,7 +152,8 @@ def process_request(self,
session_id, _stats = res_queue.get()
# print(f'\n{"-" * 50}\n'
# f'session {session_id} stats: \n{_stats}\n{"-" * 50}\n')
stats.append(np.array(_stats))
if len(_stats) != 0:
stats.append(np.array(_stats))

stats = np.concatenate(stats).reshape(-1, 5)

Expand Down
2 changes: 1 addition & 1 deletion docs/en/benchmark/profile_api_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/
lmdeploy serve api_server internlm/internlm-7b
```

If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/restful_api.md) guide to get the detailed explanation.
If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/api_server.md) guide to get the detailed explanation.

### Profile

Expand Down
3 changes: 2 additions & 1 deletion docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Welcome to LMDeploy's tutorials!
:caption: Inference

inference/pipeline.md
inference/vl_pipeline.md
inference/turbomind.md
inference/turbomind_config.md
inference/pytorch.md
Expand All @@ -48,7 +49,7 @@ Welcome to LMDeploy's tutorials!
:maxdepth: 1
:caption: serving

serving/restful_api.md
serving/api_server.md
serving/gradio.md
serving/proxy_server.md

Expand Down
146 changes: 146 additions & 0 deletions docs/en/inference/vl_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# VLM Offline Inference Pipeline

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the the Large Language Model (LLM) inference [pipeline](./pipeline.md).
In this article, we will take the [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) model as an example, exhibiting the powerful capabilities of the VLM pipeline through various examples.
First, we will demonstrate the most basic utilization of the pipeline and progressively unveil additional functionalities by configuring the engine parameters and generation arguments, such as tensor parallelism, setting context window size, and random sampling, customizing chat template and so on. Next, we will provide inference examples for scenarios involving multiple images, batch prompts etc.

## A 'Hello, world' example

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```

If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.

In the above example, the inference prompt is a tuple structure consisting of (prompt, image). Besides this structure, the pipeline also supports prompts in the OpenAI format:

```python
from lmdeploy import pipeline

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

prompts = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': 'describe this image'},
{'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
]
}
]
response = pipe(prompts)
print(response)
```

### Set tensor parallelism

Tensor paramllelism can be activated by setting the engine parameter `tp`

```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(tp=2))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```

### Set context window size

When creating the pipeline, you can customize the size of the context window by setting the engine parameter `session_len`.

```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```

### Set sampling parameters

You can change the default sampling parameters of pipeline by passing `GenerationConfig`

```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image), gen_config=gen_config)
print(response)
```

### Set chat template

While performing inference, LMDeploy identifies an appropriate chat template from its builtin collection based on the model path and subsequently applies this template to the input prompts. However, when a chat template cannot be told from its model path, users have to specify it. For example, liuhaotian/llava-v1.5-7b employs the 'vicuna' chat template, but the name 'vicuna' cannot be ascertained from the model's path. We can specify it by setting 'vicuna' to `ChatTemplateConfig` as follows:

```python
from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.5-7b',
chat_template_config=ChatTemplateConfig(model_name='vicuna'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```

For more information about customizing a chat template, please refer to [this](../advance/chat_template.md) guide

## Multi-images inference

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the [context window](#set-context-window-size) typically needs to be increased.

```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
response = pipe(('describe these images', images))
print(response)
```

## Batch prompts inference

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)
```
2 changes: 1 addition & 1 deletion docs/en/quantization/w4a16.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,4 +131,4 @@ The default port of `api_server` is `23333`. After the server is launched, you c
lmdeploy serve api_client http://0.0.0.0:23333
```

You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/restful_api.md).
You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/api_server.md).
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Serving with OpenAI Compatible Server
# Serving LLM with OpenAI Compatible Server

This article primarily discusses the deployment of a single LLM model across multiple GPUs on a single node, providing a service that is compatible with the OpenAI interface, as well as the usage of the service API.
For the sake of convenience, we refer to this service as `api_server`. Regarding parallel services with multiple models, please refer to the guide about [Request Distribution Server](./proxy_server.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/en/supported_models/codellama.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,4 +108,4 @@ or through webui after launching gradio,
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
```

Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../serving/restful_api.md).
Regarding the detailed information of RESTful API, you can refer to the [guide](../serving/api_server.md).
2 changes: 1 addition & 1 deletion docs/zh_cn/benchmark/profile_api_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ $$
lmdeploy serve api_server internlm/internlm-7b
```

如果你想改变 server 的端口,或者诸如推理引擎、最大批处理值等参数,请运行 `lmdeploy serve api_server -h` 或者阅读[这篇文档](../serving/restful_api.md),查看详细的参数说明。
如果你想改变 server 的端口,或者诸如推理引擎、最大批处理值等参数,请运行 `lmdeploy serve api_server -h` 或者阅读[这篇文档](../serving/api_server.md),查看详细的参数说明。

### 测速

Expand Down
3 changes: 2 additions & 1 deletion docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
:caption: 推理

inference/pipeline.md
inference/vl_pipeline.md
inference/turbomind.md
inference/turbomind_config.md
inference/pytorch.md
Expand All @@ -49,7 +50,7 @@
:maxdepth: 1
:caption: 服务

serving/restful_api.md
serving/api_server.md
serving/gradio.md
serving/proxy_server.md

Expand Down
Loading

0 comments on commit 3c3e373

Please sign in to comment.