Merge branch 'main' into slora-pipe

Conflicts: lmdeploy/serve/async_engine.py
AllentDan · Mar 15, 2024 · 3c3e373 · 3c3e373
2 parents 5bd31fb + 4620ed3
commit 3c3e373
Show file tree

Hide file tree

Showing 44 changed files with 1,482 additions and 146 deletions.
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ ______________________________________________________________________
 <summary><b>2024</b></summary>
 
 - \[2024/02\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
-- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/restful_api.md).
+- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/api_server.md).
 - \[2024/01\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](./docs/en/serving/proxy_server.md)
 - \[2024/01\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable  rapid experimentation with new features and technologies.
 
@@ -152,7 +152,7 @@ For detailed user guides and advanced guides, please refer to our [tutorials](ht
   - [Inference pipeline](./docs/en/inference/pipeline.md)
   - [Inference Engine - TurboMind](docs/en/inference/turbomind.md)
   - [Inference Engine - PyTorch](docs/en/inference/pytorch.md)
-  - [Serving](docs/en/serving/restful_api.md)
+  - [Serving](docs/en/serving/api_server.md)
   - [Quantization](docs/en/quantization)
 - Advance Guide
   - Add chat template

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -27,7 +27,7 @@ ______________________________________________________________________
 <summary><b>2024</b></summary>
 
 - \[2024/02\] 支持 Qwen 1.5、Gemma、Mistral、Mixtral、Deepseek-MOE 等模型
-- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) 发布，支持无缝接入[LMDeploy Serving Service](./docs/zh_cn/serving/restful_api.md)
+- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) 发布，支持无缝接入[LMDeploy Serving Service](./docs/zh_cn/serving/api_server.md)
 - \[2024/01\] 支持多模型、多机、多卡推理服务。使用方法请参考[此处](./docs/zh_cn/serving/proxy_server.md)
 - \[2024/01\] 增加 [PyTorch 推理引擎](./docs/zh_cn/inference/pytorch.md)，作为 TurboMind 引擎的补充。帮助降低开发门槛，和快速实验新特性、新技术
 
@@ -153,7 +153,7 @@ print(response)
   - [推理pipeline](./docs/zh_cn/inference/pipeline.md)
   - [推理引擎 - TurboMind](./docs/zh_cn/inference/turbomind.md)
   - [推理引擎 - PyTorch](./docs/zh_cn/inference/pytorch.md)
-  - [推理服务](./docs/zh_cn/serving/restful_api.md)
+  - [推理服务](./docs/zh_cn/serving/api_server.md)
   - [模型量化](./docs/zh_cn/quantization)
 - 进阶指南
   - 增加对话模板

diff --git a/benchmark/profile_serving.py b/benchmark/profile_serving.py
@@ -83,7 +83,6 @@ def _inference(self, req_queue: Queue, res_queue: Queue, session_id: int,
 
         chatbot = Chatbot(self.server_addr,
                           ignore_eos=True,
-                          profile_serving=True,
                           top_k=self.top_k,
                           top_p=self.top_p,
                           temperature=self.temperature,
@@ -153,7 +152,8 @@ def process_request(self,
             session_id, _stats = res_queue.get()
             # print(f'\n{"-" * 50}\n'
             #       f'session {session_id} stats: \n{_stats}\n{"-" * 50}\n')
-            stats.append(np.array(_stats))
+            if len(_stats) != 0:
+                stats.append(np.array(_stats))
 
         stats = np.concatenate(stats).reshape(-1, 5)
 

diff --git a/docs/en/benchmark/profile_api_server.md b/docs/en/benchmark/profile_api_server.md
@@ -41,7 +41,7 @@ In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/
 lmdeploy serve api_server internlm/internlm-7b
 ```
 
-If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/restful_api.md) guide to get the detailed explanation.
+If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/api_server.md) guide to get the detailed explanation.
 
 ### Profile
 

diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -39,6 +39,7 @@ Welcome to LMDeploy's tutorials!
    :caption: Inference
 
    inference/pipeline.md
+   inference/vl_pipeline.md
    inference/turbomind.md
    inference/turbomind_config.md
    inference/pytorch.md
@@ -48,7 +49,7 @@ Welcome to LMDeploy's tutorials!
    :maxdepth: 1
    :caption: serving
 
-   serving/restful_api.md
+   serving/api_server.md
    serving/gradio.md
    serving/proxy_server.md
 

diff --git a/docs/en/inference/vl_pipeline.md b/docs/en/inference/vl_pipeline.md
@@ -0,0 +1,146 @@
+# VLM Offline Inference Pipeline
+
+LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the the Large Language Model (LLM) inference [pipeline](./pipeline.md).
+In this article, we will take the [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) model as an example, exhibiting the powerful capabilities of the VLM pipeline through various examples.
+First, we will demonstrate the most basic utilization of the pipeline and progressively unveil additional functionalities by configuring the engine parameters and generation arguments, such as tensor parallelism, setting context window size, and random sampling, customizing chat template and so on. Next, we will provide inference examples for scenarios involving multiple images, batch prompts etc.
+
+## A 'Hello, world' example
+
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
+
+In the above example, the inference prompt is a tuple structure consisting of (prompt, image). Besides this structure, the pipeline also supports prompts in the OpenAI format:
+
+```python
+from lmdeploy import pipeline
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+prompts = [
+    {
+        'role': 'user',
+        'content': [
+            {'type': 'text', 'text': 'describe this image'},
+            {'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
+        ]
+    }
+]
+response = pipe(prompts)
+print(response)
+```
+
+### Set tensor parallelism
+
+Tensor paramllelism can be activated by setting the engine parameter `tp`
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### Set context window size
+
+When creating the pipeline, you can customize the size of the context window by setting the engine parameter `session_len`.
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### Set sampling parameters
+
+You can change the default sampling parameters of pipeline by passing `GenerationConfig`
+
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
+gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image), gen_config=gen_config)
+print(response)
+```
+
+### Set chat template
+
+While performing inference, LMDeploy identifies an appropriate chat template from its builtin collection based on the model path and subsequently applies this template to the input prompts. However, when a chat template cannot be told from its model path, users have to specify it. For example, liuhaotian/llava-v1.5-7b employs the 'vicuna' chat template, but the name 'vicuna' cannot be ascertained from the model's path. We can specify it by setting 'vicuna' to `ChatTemplateConfig` as follows:
+
+```python
+from lmdeploy import pipeline, ChatTemplateConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.5-7b',
+                chat_template_config=ChatTemplateConfig(model_name='vicuna'))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+For more information about customizing a chat template, please refer to [this](../advance/chat_template.md) guide
+
+## Multi-images inference
+
+When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the [context window](#set-context-window-size) typically needs to be increased.
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
+]
+
+images = [load_image(img_url) for img_url in image_urls]
+response = pipe(('describe these images', images))
+print(response)
+```
+
+## Batch prompts inference
+
+Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
+]
+prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
+response = pipe(prompts)
+print(response)
+```
diff --git a/docs/en/quantization/w4a16.md b/docs/en/quantization/w4a16.md
@@ -131,4 +131,4 @@ The default port of `api_server` is `23333`. After the server is launched, you c
 lmdeploy serve api_client http://0.0.0.0:23333
 ```
 
-You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/restful_api.md).
+You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/api_server.md).
diff --git a/docs/en/serving/restful_api.md → docs/en/serving/api_server.md b/docs/en/serving/restful_api.md → docs/en/serving/api_server.md
@@ -1,4 +1,4 @@
-# Serving with OpenAI Compatible Server
+# Serving LLM with OpenAI Compatible Server
 
 This article primarily discusses the deployment of a single LLM model across multiple GPUs on a single node, providing a service that is compatible with the OpenAI interface, as well as the usage of the service API.
 For the sake of convenience, we refer to this service as `api_server`. Regarding parallel services with multiple models, please refer to the guide about [Request Distribution Server](./proxy_server.md).

diff --git a/docs/en/supported_models/codellama.md b/docs/en/supported_models/codellama.md
@@ -108,4 +108,4 @@ or through webui after launching gradio,
 lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
 ```
 
-Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../serving/restful_api.md).
+Regarding the detailed information of RESTful API, you can refer to the [guide](../serving/api_server.md).
diff --git a/docs/zh_cn/benchmark/profile_api_server.md b/docs/zh_cn/benchmark/profile_api_server.md
@@ -41,7 +41,7 @@ $$
 lmdeploy serve api_server internlm/internlm-7b
 ```
 
-如果你想改变 server 的端口，或者诸如推理引擎、最大批处理值等参数，请运行 `lmdeploy serve api_server -h` 或者阅读[这篇文档](../serving/restful_api.md)，查看详细的参数说明。
+如果你想改变 server 的端口，或者诸如推理引擎、最大批处理值等参数，请运行 `lmdeploy serve api_server -h` 或者阅读[这篇文档](../serving/api_server.md)，查看详细的参数说明。
 
 ### 测速
 

diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
@@ -40,6 +40,7 @@
    :caption: 推理
 
    inference/pipeline.md
+   inference/vl_pipeline.md
    inference/turbomind.md
    inference/turbomind_config.md
    inference/pytorch.md
@@ -49,7 +50,7 @@
    :maxdepth: 1
    :caption: 服务
 
-   serving/restful_api.md
+   serving/api_server.md
    serving/gradio.md
    serving/proxy_server.md