diff --git a/docs/en/index.rst b/docs/en/index.rst
index d109a494c0..36ded55fd7 100644
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@@ -39,6 +39,7 @@ Welcome to LMDeploy's tutorials!
    :caption: Inference
 
    inference/pipeline.md
+   inference/vl_pipeline.md
    inference/turbomind.md
    inference/turbomind_config.md
    inference/pytorch.md
diff --git a/docs/en/inference/vl_pipeline.md b/docs/en/inference/vl_pipeline.md
new file mode 100644
index 0000000000..b48383a64a
--- /dev/null
+++ b/docs/en/inference/vl_pipeline.md
@@ -0,0 +1,146 @@
+# VLM Offline Inference Pipeline
+
+LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the the Large Language Model (LLM) inference [pipeline](./pipeline.md).
+In this article, we will take the [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) model as an example, exhibiting the powerful capabilities of the VLM pipeline through various examples.
+First, we will demonstrate the most basic utilization of the pipeline and progressively unveil additional functionalities by configuring the engine parameters and generation arguments, such as tensor parallelism, setting context window size, and random sampling, customizing chat template and so on. Next, we will provide inference examples for scenarios involving multiple images, batch prompts etc.
+
+## A 'Hello, world' example
+
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
+
+In the above example, the inference prompt is a tuple structure consisting of (prompt, image). Besides this structure, the pipeline also supports prompts in the OpenAI format:
+
+```python
+from lmdeploy import pipeline
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+prompts = [
+    {
+        'role': 'user',
+        'content': [
+            {'type': 'text', 'text': 'describe this image'},
+            {'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
+        ]
+    }
+]
+response = pipe(prompts)
+print(response)
+```
+
+### Set tensor parallelism
+
+Tensor paramllelism can be activated by setting the engine parameter `tp`
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### Set context window size
+
+When creating the pipeline, you can customize the size of the context window by setting the engine parameter `session_len`.
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### Set sampling parameters
+
+You can change the default sampling parameters of pipeline by passing `GenerationConfig`
+
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
+gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image), gen_config=gen_config)
+print(response)
+```
+
+### Set chat template
+
+While performing inference, LMDeploy identifies an appropriate chat template from its builtin collection based on the model path and subsequently applies this template to the input prompts. However, when a chat template cannot be told from its model path, users have to specify it. For example, liuhaotian/llava-v1.5-7b employs the 'vicuna' chat template, but the name 'vicuna' cannot be ascertained from the model's path. We can specify it by setting 'vicuna' to `ChatTemplateConfig` as follows:
+
+```python
+from lmdeploy import pipeline, ChatTemplateConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.5-7b',
+                chat_template_config=ChatTemplateConfig(model_name='vicuna'))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+For more information about customizing a chat template, please refer to [this](../advance/chat_template.md) guide
+
+## Multi-images inference
+
+When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the [context window](#set-context-window-size) typically needs to be increased.
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
+]
+
+images = [load_image(img_url) for img_url in image_urls]
+response = pipe(('describe these images', images))
+print(response)
+```
+
+## Batch prompts inference
+
+Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
+]
+prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
+response = pipe(prompts)
+print(response)
+```
diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
index 3c1c0cc492..10c2187df9 100644
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@@ -40,6 +40,7 @@
    :caption: 推理
 
    inference/pipeline.md
+   inference/vl_pipeline.md
    inference/turbomind.md
    inference/turbomind_config.md
    inference/pytorch.md
diff --git a/docs/zh_cn/inference/vl_pipeline.md b/docs/zh_cn/inference/vl_pipeline.md
new file mode 100644
index 0000000000..ef763bfe42
--- /dev/null
+++ b/docs/zh_cn/inference/vl_pipeline.md
@@ -0,0 +1,145 @@
+# VLM 离线推理 pipeline
+
+LMDeploy 把视觉-语言模型（VLM）复杂的推理过程，抽象为简单好用的 pipeline。它的用法与大语言模型（LLM）推理 [pipeline](./pipeline.md) 类似。本文将以 [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) 模型为例，通过若干示例，展示 VLM pipeline 的强大能力。
+首先，我们会展示 pipeline 最基础的用法，并在此基础上，通过引擎的配置和生成条件配置，逐步引出更多能力，比如模型并行、自定义上下文长度、随机采样等等。然后，针对多图、批量提示词等场景，给出对应的推理示例。
+
+## "Hello, world" 示例
+
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+如果在执行这个用例时，出现 `ImportError` 的错误，请按照提示安装相关的依赖包。
+
+上面的例子中，推理时的提示词是 (prompt, image) 的 tuple 结构。除了这种结构外，pipeline 支持 openai 格式的提示词：
+
+```python
+from lmdeploy import pipeline
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+
+prompts = [
+    {
+        'role': 'user',
+        'content': [
+            {'type': 'text', 'text': 'describe this image'},
+            {'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
+        ]
+    }
+]
+response = pipe(prompts)
+print(response)
+```
+
+### 设置多卡并行
+
+设置引擎参数 `tp`，可激活多卡并行能力
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### 设置上下文长度
+
+创建 pipeline 时，通过设置引擎参数 `session_len`，可以定制上下文窗口的最大长度
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+### 设置随机采样参数
+
+可通过传入 `GenerationConfig` 修改 pipeline 的生成接口中的默认采样参数。
+
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
+gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image), gen_config=gen_config)
+print(response)
+```
+
+### 设置对话模板
+
+推理时，LMDeploy 会根据模型路径匹配内置的对话模板，并把对话模板应用到输入的提示词上。但是，对于类似 [llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) 视觉-语言模型，它使用的对话模板是 vicuna，但是这个模板名无法从模型路径中获取，所以需要用户指定。具体方式如下：
+
+```python
+from lmdeploy import pipeline, ChatTemplateConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.5-7b',
+                chat_template_config=ChatTemplateConfig(model_name='vicuna'))
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+
+关于如何自定义对话模版，请参考[这里](../advance/chat_template.md)
+
+## 多图推理
+
+对于多图的场景，在推理时，只要把它们放在一个列表中即可。不过，多图意味着输入 token 数更多，所以通常需要[增大推理的上下文长度](#设置上下文长度)
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
+]
+
+images = [load_image(img_url) for img_url in image_urls]
+response = pipe(('describe these images', images))
+print(response)
+```
+
+## 提示词批处理
+
+做批量提示词推理非常简单，只要把它们放在一个 list 结构中：
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+
+image_urls=[
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
+]
+prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
+response = pipe(prompts)
+print(response)
+```
diff --git a/lmdeploy/api.py b/lmdeploy/api.py
index 78607bfd73..50af10b41d 100644
--- a/lmdeploy/api.py
+++ b/lmdeploy/api.py
@@ -2,7 +2,7 @@
 import os
 from typing import List, Literal, Optional, Union
 
-from .archs import autoget_backend_config
+from .archs import autoget_backend_config, get_task
 from .messages import PytorchEngineConfig, TurbomindEngineConfig
 from .model import ChatTemplateConfig
 
@@ -39,19 +39,36 @@ def pipeline(model_path: str,
         log_level(str): set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]
 
     Examples:
+        >>> # LLM
         >>> import lmdeploy
         >>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b')
         >>> response = pipe(['hi','say this is a test'])
         >>> print(response)
+        >>>
+        >>> # VLM
+        >>> from lmdeploy.vl import load_image
+        >>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
+        >>> pipe = pipeline('liuhaotian/llava-v1.5-7b',
+        ...                 backend_config=TurbomindEngineConfig(session_len=8192),
+        ...                 chat_template_config=ChatTemplateConfig(model_name='vicuna'))
+        >>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
+        >>> response = pipe([('describe this image', [im])])
+        >>> print(response)
     """ # noqa E501
-    from lmdeploy.serve.async_engine import AsyncEngine
     if os.getenv('TM_LOG_LEVEL') is None:
         os.environ['TM_LOG_LEVEL'] = log_level
     from lmdeploy.utils import get_logger
     logger = get_logger('lmdeploy')
     logger.setLevel(log_level)
 
-    if type(backend_config) is not PytorchEngineConfig:
+    pipeline_type, pipeline_class = get_task(model_path)
+    if pipeline_type == 'vlm':
+        assert (type(backend_config) is TurbomindEngineConfig) or \
+            (backend_config is None), \
+            f'{pipeline_type} model only support turbomind backend.'
+
+    if pipeline_type == 'llm' and type(
+            backend_config) is not PytorchEngineConfig:
         # set auto backend mode
         backend_config = autoget_backend_config(model_path, backend_config)
     backend = 'pytorch' if type(
@@ -65,13 +82,14 @@ def pipeline(model_path: str,
         kwargs.pop('tp')
     else:
         tp = 1 if backend_config is None else backend_config.tp
-    return AsyncEngine(model_path,
-                       model_name=model_name,
-                       backend=backend,
-                       backend_config=backend_config,
-                       chat_template_config=chat_template_config,
-                       tp=tp,
-                       **kwargs)
+
+    return pipeline_class(model_path,
+                          model_name=model_name,
+                          backend=backend,
+                          backend_config=backend_config,
+                          chat_template_config=chat_template_config,
+                          tp=tp,
+                          **kwargs)
 
 
 def serve(model_path: str,
diff --git a/lmdeploy/archs.py b/lmdeploy/archs.py
index 7a8b7f817e..2b945ba399 100644
--- a/lmdeploy/archs.py
+++ b/lmdeploy/archs.py
@@ -1,9 +1,15 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Literal, Optional, Union
 
+from lmdeploy.serve.async_engine import AsyncEngine
+from lmdeploy.serve.vl_async_engine import VLAsyncEngine
+from lmdeploy.utils import get_hf_config_content
+
 from .messages import PytorchEngineConfig, TurbomindEngineConfig
 from .utils import get_logger
 
+SUPPORTED_TASKS = {'llm': AsyncEngine, 'vlm': VLAsyncEngine}
+
 logger = get_logger('lmdeploy')
 
 
@@ -80,3 +86,23 @@ def autoget_backend_config(
             if v and hasattr(config, k):
                 setattr(config, k, v)
     return config
+
+
+def check_vl_llm(config: dict) -> bool:
+    """check if the model is a vl model from model config."""
+    arch = config['architectures'][0]
+    if arch == 'LlavaLlamaForCausalLM':
+        return True
+    elif arch == 'QWenLMHeadModel' and 'visual' in config:
+        return True
+    return False
+
+
+def get_task(model_path: str):
+    """get pipeline type and pipeline class from model config."""
+    config = get_hf_config_content(model_path)
+    if check_vl_llm(config):
+        return 'vlm', VLAsyncEngine
+
+    # default task, pipeline_class
+    return 'llm', AsyncEngine
diff --git a/lmdeploy/model.py b/lmdeploy/model.py
index ec81cd453b..d16366eac6 100644
--- a/lmdeploy/model.py
+++ b/lmdeploy/model.py
@@ -771,7 +771,8 @@ def match(cls, model_path: str) -> Optional[str]:
         Args:
             model_path (str): the model path used for matching.
         """
-        if 'yi' in model_path.lower():
+        path = model_path.lower()
+        if 'yi' in path and 'vl' not in path:
             return 'yi'
 
 
@@ -868,6 +869,38 @@ def match(cls, model_path: str) -> Optional[str]:
             return 'deepseek'
 
 
+@MODELS.register_module(name=['yi-vl'])
+class YiVL(BaseChatTemplate):
+
+    def __init__(
+            self,
+            meta_instruction="""This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。\n\n""",  # noqa: E501
+            user='### Human: ',
+            eoh='\n',
+            assistant='### Assistant: ',
+            eoa='\n',
+            stop_words=['###'],
+            **kwargs):
+        super().__init__(meta_instruction=meta_instruction,
+                         user=user,
+                         eoh=eoh,
+                         assistant=assistant,
+                         eoa=eoa,
+                         stop_words=stop_words,
+                         **kwargs)
+
+    @classmethod
+    def match(cls, model_path: str) -> Optional[str]:
+        """Return the model_name that was registered to MODELS.
+
+        Args:
+            model_path (str): the model path used for matching.
+        """
+        path = model_path.lower()
+        if 'yi-vl' in path:
+            return 'yi-vl'
+
+
 def best_match_model(query: str) -> Optional[str]:
     """Get the model that matches the query.
 
diff --git a/lmdeploy/serve/async_engine.py b/lmdeploy/serve/async_engine.py
index 570baf1f07..dddeed7dd9 100644
--- a/lmdeploy/serve/async_engine.py
+++ b/lmdeploy/serve/async_engine.py
@@ -437,6 +437,13 @@ async def gather():
 
             proc.join()
 
+    async def _get_prompt_input(self, prompt: str, do_preprocess: bool,
+                                sequence_start: bool):
+        if do_preprocess:
+            prompt = self.chat_template.messages2prompt(prompt, sequence_start)
+        input_ids = self.tokenizer.encode(prompt, add_bos=sequence_start)
+        return {'prompt': prompt, 'input_ids': input_ids}
+
     async def generate(
             self,
             messages,
@@ -478,10 +485,12 @@ async def generate(
         if gen_config.random_seed is None and sequence_start:
             gen_config.random_seed = random.getrandbits(64)
         prompt = messages
-        if do_preprocess:
-            prompt = self.chat_template.messages2prompt(prompt, sequence_start)
+
+        prompt_input = await self._get_prompt_input(prompt, do_preprocess,
+                                                    sequence_start)
+        prompt = prompt_input['prompt']
         logger.info(f'Prompt with applied chat template:\n{prompt}')
-        input_ids = self.tokenizer.encode(prompt, add_bos=sequence_start)
+        input_ids = prompt_input['input_ids']
         if gen_config.max_new_tokens is None:
             # for interactive endpoint, will try maximum possible token num
             gen_config.max_new_tokens = max(
@@ -501,7 +510,7 @@ async def generate(
                 state = DetokenizeState()
                 async for outputs in generator.async_stream_infer(
                         session_id=session_id,
-                        input_ids=input_ids,
+                        **prompt_input,
                         gen_config=gen_config,
                         stream_output=stream_response,
                         sequence_start=(sequence_start),
diff --git a/lmdeploy/serve/vl_async_engine.py b/lmdeploy/serve/vl_async_engine.py
new file mode 100644
index 0000000000..26af134777
--- /dev/null
+++ b/lmdeploy/serve/vl_async_engine.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Union
+
+import numpy as np
+
+from lmdeploy.serve.async_engine import AsyncEngine
+from lmdeploy.vl.constants import IMAGE_DUMMY_TOKEN_INDEX, IMAGE_TOKEN
+from lmdeploy.vl.engine import ImageEncoder
+from lmdeploy.vl.templates import VLPromptType, get_vl_prompt_template
+
+
+class VLAsyncEngine(AsyncEngine):
+    """Visual Language Async inference engine."""
+
+    def __init__(self, model_path: str, **kwargs) -> None:
+        super().__init__(model_path, **kwargs)
+        self.vl_encoder = ImageEncoder(model_path)
+        self.vl_prompt_template = get_vl_prompt_template(
+            model_path, self.chat_template, self.model_name)
+
+    def _convert_prompts(self,
+                         prompts: Union[VLPromptType, List[Dict],
+                                        List[VLPromptType], List[List[Dict]]]):
+        """convert prompts to openai format."""
+        if isinstance(prompts, str) or isinstance(prompts, tuple):
+            _prompts = self.vl_prompt_template.prompt_to_messages(prompts)
+        elif isinstance(prompts[0], tuple) or isinstance(prompts[0], str):
+            _prompts = [
+                self.vl_prompt_template.prompt_to_messages(x) for x in prompts
+            ]
+        else:
+            _prompts = prompts
+        return _prompts
+
+    async def _get_prompt_input(self, prompt: Dict, do_preprocess: bool,
+                                sequence_start: bool):
+        """get input_ids, embeddings and offsets."""
+        if do_preprocess:
+            decorated = self.vl_prompt_template.messages2prompt(
+                prompt, sequence_start)
+        else:
+            decorated = prompt
+        segs = decorated.split(IMAGE_TOKEN)
+
+        results = {}
+        input_ids = []
+        if len(segs) > 1:
+            images = await self.vl_prompt_template.async_collect_pil_images(
+                prompt)
+            features = await self.vl_encoder.async_infer(images)
+            features = [x.cpu().numpy() for x in features]
+            input_ids = []
+            begins = []
+            ends = []
+            for i, seg in enumerate(segs):
+                if i > 0:
+                    image_dim = features[i - 1].shape[0]
+                    begins.append(len(input_ids))
+                    ends.append(begins[-1] + image_dim)
+                    input_ids.extend([IMAGE_DUMMY_TOKEN_INDEX] * image_dim)
+                seg_ids = self.tokenizer.encode(seg,
+                                                add_bos=((i == 0)
+                                                         and sequence_start))
+                input_ids.extend(seg_ids)
+            ranges = np.stack([begins, ends], axis=1).tolist()
+            results['input_embeddings'] = features
+            results['input_embedding_ranges'] = ranges
+        else:
+            input_ids = self.tokenizer.encode(decorated,
+                                              add_bos=sequence_start)
+
+        results['input_ids'] = input_ids
+        results['prompt'] = decorated
+        return results
+
+    def batch_infer(self, prompts: Union[VLPromptType, List[Dict],
+                                         List[VLPromptType], List[List[Dict]]],
+                    **kwargs):
+        """Inference a batch of prompts."""
+        prompts = self._convert_prompts(prompts)
+        return super().batch_infer(prompts, **kwargs)
+
+    def stream_infer(self, prompts: Union[VLPromptType, List[Dict],
+                                          List[VLPromptType],
+                                          List[List[Dict]]], **kwargs):
+        """Inference a batch of prompts with stream mode."""
+        prompts = self._convert_prompts(prompts)
+        return super().stream_infer(prompts, **kwargs)
+
+    def __call__(self, prompts: Union[VLPromptType, List[Dict],
+                                      List[VLPromptType], List[List[Dict]]],
+                 **kwargs):
+        """Inference a batch of prompts."""
+        prompts = self._convert_prompts(prompts)
+        return super().__call__(prompts, **kwargs)
diff --git a/lmdeploy/tokenizer.py b/lmdeploy/tokenizer.py
index 0a3b32a392..71a543c460 100644
--- a/lmdeploy/tokenizer.py
+++ b/lmdeploy/tokenizer.py
@@ -290,10 +290,14 @@ def indexes_containing_token(self, token: str):
         if self.token2id == {}:
             # decode is slower than convert_ids_to_tokens
             if self.maybe_decode_bytes:
-                self.token2id = {
-                    self.model.decode(i): i
-                    for i in range(self.vocab_size)
-                }
+                try:
+                    self.token2id = {
+                        self.model.decode(i): i
+                        for i in range(self.vocab_size)
+                    }
+                except Exception as e:
+                    # qwen-vl
+                    assert str(e) == 'Unclosed image token'
             else:
                 self.token2id = {
                     self.model.convert_ids_to_tokens(i): i
@@ -303,7 +307,9 @@ def indexes_containing_token(self, token: str):
             token = '▁'
         indexes = [i for _token, i in self.token2id.items() if token in _token]
         if len(indexes) > self.max_indexes_num:
-            indexes = self.encode(token, add_bos=False)[-1:]
+            # multiple id decode to same token
+            indexes = [i for i in indexes if self.decode([i]) == token]
+            indexes = indexes[:self.max_indexes_num]
             self.logger.warning(
                 f'There are too many(>{self.max_indexes_num}) possible '
                 f'indexes may decoding {token}, we will use {indexes} only')
diff --git a/lmdeploy/turbomind/utils.py b/lmdeploy/turbomind/utils.py
index d391272542..bf55490310 100644
--- a/lmdeploy/turbomind/utils.py
+++ b/lmdeploy/turbomind/utils.py
@@ -1,8 +1,6 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-import json
 import os
 
-from huggingface_hub import hf_hub_download
 from transformers.utils import ExplicitEnum
 
 from lmdeploy.utils import get_logger
@@ -16,42 +14,6 @@ class ModelSource(ExplicitEnum):
     HF_MODEL = 'hf_model'
 
 
-def create_hf_download_args(**kwargs) -> dict:
-    download_kwargs = {
-        'revision': None,
-        'cache_dir': None,
-        'proxies': None,
-        'resume_download': True,
-        'force_download': False,
-        'token': None,
-        'local_files_only': False
-    }
-    for k in download_kwargs.keys():
-        if k in kwargs:
-            download_kwargs[k] = kwargs[k]
-    return download_kwargs
-
-
-def get_hf_config_path(pretrained_model_name_or_path, **kwargs) -> str:
-    """Get local hf config local file path."""
-    if os.path.exists(pretrained_model_name_or_path):
-        config_path = os.path.join(pretrained_model_name_or_path,
-                                   'config.json')
-    else:
-        download_kwargs = create_hf_download_args(**kwargs)
-        config_path = hf_hub_download(pretrained_model_name_or_path,
-                                      'config.json', **download_kwargs)
-    return config_path
-
-
-def get_hf_config_content(pretrained_model_name_or_path, **kwargs) -> dict:
-    """Get config content of a hf model."""
-    config_path = get_hf_config_path(pretrained_model_name_or_path, **kwargs)
-    with open(config_path, 'r') as f:
-        config = json.load(f)
-    return config
-
-
 def get_model_source(pretrained_model_name_or_path: str,
                      **kwargs) -> ModelSource:
     """Get model source."""
@@ -62,24 +24,6 @@ def get_model_source(pretrained_model_name_or_path: str,
     return ModelSource.HF_MODEL
 
 
-def check_tm_model_input(pretrained_model_name_or_path, **kwargs):
-    """Check if single input pretrained_model_name_or_path is enough to use."""
-    if kwargs.get('model_name', None):
-        return
-
-    model_source = get_model_source(pretrained_model_name_or_path, **kwargs)
-    if model_source == ModelSource.WORKSPACE:
-        return
-
-    config = get_hf_config_content(pretrained_model_name_or_path, **kwargs)
-    if 'turbomind' in config and config['turbomind']['model_name'] != '':
-        return
-
-    assert (0), '\nCan not get model name from input model, '\
-        'please supply model name with arg --model-name,' \
-        'you can list supported models by `lmdeploy list`'
-
-
 def get_model_from_config(model_dir: str):
     import json
     config_file = os.path.join(model_dir, 'config.json')
@@ -91,6 +35,7 @@ def get_model_from_config(model_dir: str):
         config = json.load(f)
 
     ARCH_MAP = {
+        'LlavaLlamaForCausalLM': default,
         'LlamaForCausalLM': default,
         'InternLM2ForCausalLM': 'internlm2',
         'InternLMForCausalLM': default,
diff --git a/lmdeploy/utils.py b/lmdeploy/utils.py
index 72823162d6..8779de4511 100644
--- a/lmdeploy/utils.py
+++ b/lmdeploy/utils.py
@@ -1,13 +1,17 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import asyncio
 import functools
+import json
 import logging
+import os
 import sys
 import time
 from contextlib import contextmanager
 from logging import Logger, LogRecord
 from typing import List, Optional
 
+from huggingface_hub import hf_hub_download
+
 logger_initialized = {}
 
 
@@ -175,6 +179,20 @@ def _stop_words(stop_words: List[str], tokenizer: object):
     return stop_words
 
 
+def get_hf_config_content(pretrained_model_name_or_path: str,
+                          **kwargs) -> dict:
+    """Get config content of a hf model."""
+    if os.path.exists(pretrained_model_name_or_path):
+        config_path = os.path.join(pretrained_model_name_or_path,
+                                   'config.json')
+    else:
+        config_path = hf_hub_download(pretrained_model_name_or_path,
+                                      'config.json')
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+    return config
+
+
 def get_model(pretrained_model_name_or_path: str,
               download_dir: str = None,
               revision: str = None):
diff --git a/lmdeploy/vl/__init__.py b/lmdeploy/vl/__init__.py
new file mode 100644
index 0000000000..d7f494ccff
--- /dev/null
+++ b/lmdeploy/vl/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .utils import load_image
+
+__all__ = ['load_image']
diff --git a/lmdeploy/vl/constants.py b/lmdeploy/vl/constants.py
new file mode 100644
index 0000000000..eeaa697cdf
--- /dev/null
+++ b/lmdeploy/vl/constants.py
@@ -0,0 +1,3 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+IMAGE_DUMMY_TOKEN_INDEX = 0
+IMAGE_TOKEN = '<IMAGE_TOKEN>'
diff --git a/lmdeploy/vl/engine.py b/lmdeploy/vl/engine.py
new file mode 100644
index 0000000000..45084b11b7
--- /dev/null
+++ b/lmdeploy/vl/engine.py
@@ -0,0 +1,122 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import asyncio
+import queue
+import time
+from threading import Thread
+from typing import List, Union
+
+from PIL.Image import Image
+
+from lmdeploy.utils import get_logger
+from lmdeploy.vl.model.builder import load_vl_model
+
+logger = get_logger('lmdeploy')
+
+
+class Record:
+    """Batching manager."""
+
+    def __init__(self):
+        self.number = []
+        self.waiting = []
+        self.done = []
+        self.res_que = []
+        self.total = 0
+
+    def enqueue(self, images: List[Image], que: Union[queue.Queue,
+                                                      asyncio.Queue]):
+        """add ith request to manager."""
+        self.number.append(len(images))
+        self.waiting.extend(images)
+        self.res_que.append(que)
+        self.total += len(images)
+        self.log('received', len(images))
+
+    def dequeue(self, max_batch_size):
+        """try to dequeue max batch size images."""
+        inputs = self.waiting[:max_batch_size]
+        self.waiting = self.waiting[max_batch_size:]
+        self.total -= len(inputs)
+        self.log('process', len(inputs))
+        return inputs
+
+    def nofify(self):
+        """set result if request i is finished."""
+        if len(self.number) == 0 or self.number[0] > len(self.done):
+            return False
+        num_images = self.number.pop(0)
+        outputs = self.done[:num_images]
+        self.done = self.done[num_images:]
+        que = self.res_que.pop(0)
+        if isinstance(que, queue.Queue):
+            que.put(outputs)
+        else:
+            que._loop.call_soon_threadsafe(que.put_nowait, outputs)
+        self.log('done', num_images)
+        return True
+
+    def log(self, task: str, num: int):
+        logger.info(f'ImageEncoder {task} {num} images, '
+                    f'left {self.total} images.')
+
+
+class ImageEncoder:
+    """Image encoder."""
+
+    def __init__(self, model_path: str, max_batch_size: int = 16):
+        self.model = load_vl_model(model_path)
+        self.max_batch_size = max_batch_size
+        self.loop = asyncio.new_event_loop()
+        self.work_thread = self._start_work_thread()
+
+    def _start_work_thread(self):
+        """internal thread."""
+
+        def _work_thread():
+            asyncio.set_event_loop(self.loop)
+            self.que = asyncio.Queue()
+            self.loop.run_until_complete(self._forward_loop())
+
+        thread = Thread(target=_work_thread, daemon=True)
+        thread.start()
+        return thread
+
+    async def _forward_loop(self):
+        """working loop to process images."""
+        logger.info('start ImageEncoder._forward_loop')
+        record = Record()
+        while True:
+            while record.total == 0 or (self.que.qsize() and
+                                        record.total < self.max_batch_size):
+                item = await self.que.get()
+                record.enqueue(item[0], item[1])
+            inputs = record.dequeue(self.max_batch_size)
+            outputs = self.forward(inputs)
+            record.done.extend(outputs)
+            while record.nofify():
+                pass
+
+    def forward(self, inputs: List[Image]):
+        """Model forward."""
+        time_start = time.perf_counter()
+        outputs = self.model.forward(inputs)
+        time_end = time.perf_counter()
+        logger.info(f'ImageEncoder forward {len(inputs)} images, '
+                    f'cost {time_end - time_start:.3f}s')
+        return outputs
+
+    def infer(self, inputs: List[Image]):
+        """infer."""
+        outputs = queue.Queue()
+        item = (inputs, outputs)
+        self.loop.call_soon_threadsafe(self.que.put_nowait, item)
+        results = outputs.get()
+        return results
+
+    async def async_infer(self, inputs: List[Image]):
+        """async infer."""
+        outputs = asyncio.Queue()
+        item = (inputs, outputs)
+        self.loop.call_soon_threadsafe(self.que.put_nowait, item)
+        results = await outputs.get()
+        return results
diff --git a/lmdeploy/vl/model/__init__.py b/lmdeploy/vl/model/__init__.py
new file mode 100644
index 0000000000..ef101fec61
--- /dev/null
+++ b/lmdeploy/vl/model/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/lmdeploy/vl/model/base.py b/lmdeploy/vl/model/base.py
new file mode 100644
index 0000000000..a0815044c1
--- /dev/null
+++ b/lmdeploy/vl/model/base.py
@@ -0,0 +1,22 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import ABC, abstractmethod
+from typing import List
+
+import PIL
+import torch
+
+
+class VisonModel(ABC):
+    """Visual model which extract image feature."""
+
+    @abstractmethod
+    def forward(self, images: List[PIL.Image.Image]) -> List[torch.Tensor]:
+        """extract image feature.
+
+        Args:
+            images (List[PIL.Image.Image]): input images
+
+        Return:
+            List[torch.Tensor]: extract image feature for each input image
+        """
+        raise NotImplementedError()
diff --git a/lmdeploy/vl/model/builder.py b/lmdeploy/vl/model/builder.py
new file mode 100644
index 0000000000..5772621d15
--- /dev/null
+++ b/lmdeploy/vl/model/builder.py
@@ -0,0 +1,25 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+
+from lmdeploy.utils import get_hf_config_content, get_model
+
+from .llava import LlavaVisionModel
+from .qwen import QwenVisionModel
+from .yi import YiVisionModel
+
+
+def load_vl_model(model_path: str):
+    """load visual model."""
+    if not os.path.exists(model_path):
+        model_path = get_model(model_path)
+    config = get_hf_config_content(model_path)
+    arch = config['architectures'][0]
+    if arch == 'QWenLMHeadModel':
+        return QwenVisionModel(model_path)
+    elif arch == 'LlavaLlamaForCausalLM':
+        projector_type = config.get('mm_projector_type', 'linear')
+        if '_Norm' in projector_type:
+            return YiVisionModel(model_path)
+        else:
+            return LlavaVisionModel(model_path)
+    raise ValueError(f'unsupported vl model with arch {arch}')
diff --git a/lmdeploy/vl/model/llava.py b/lmdeploy/vl/model/llava.py
new file mode 100644
index 0000000000..c7918916e9
--- /dev/null
+++ b/lmdeploy/vl/model/llava.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from
+# https://github.com/haotian-liu/LLaVA.git
+
+import warnings
+from typing import List, Union
+
+import torch
+from PIL.Image import Image
+
+from lmdeploy.utils import get_logger
+from lmdeploy.vl.model.base import VisonModel
+from lmdeploy.vl.model.utils import load_model_from_weight_files
+
+logger = get_logger('lmdeploy')
+
+
+def check_llava_install():
+    """check llava install."""
+    try:
+        import llava  # noqa: F401
+    except ImportError:
+        raise ImportError(
+            'To use LlavaVLModel, please install llava by '
+            'pip install git+https://github.com/haotian-liu/LLaVA.git')
+
+
+class LlavaVisionModel(VisonModel):
+    """Llava visual model."""
+
+    def __init__(self, model_path, device='cuda'):
+        self.model_path = model_path
+        self.device = device
+        self.build_model()
+
+    def build_model(self):
+        """build model & load weights."""
+        # check llava install
+        check_llava_install()
+
+        # currently, only support llava llama
+        from llava.model.language_model.llava_llama import (
+            LlavaConfig, LlavaLlamaForCausalLM)
+        self.config = LlavaConfig.from_pretrained(self.model_path)
+        assert self.config.model_type in ['llava', 'llava_llama'], \
+            'currently, only support llava llama'
+
+        # empty init
+        with torch.device('meta'), warnings.catch_warnings():
+            warnings.simplefilter('ignore')
+            model = LlavaLlamaForCausalLM.from_pretrained(self.model_path)
+            del model.lm_head
+            del model.model.embed_tokens
+            del model.model.layers
+            del model.model.norm
+
+        # # load weight
+        with torch.device(self.device):
+            model.to_empty(device=self.device)
+            vision_tower = model.get_vision_tower()
+            vision_tower.is_loaded = False
+            vision_tower.load_model()
+            load_model_from_weight_files(model, self.model_path)
+            model.eval().half()
+
+        self.model = model.model
+        self.vision_tower = model.model.vision_tower
+        self.mm_projector = model.model.mm_projector
+
+    def encode_images(self, images: torch.Tensor) -> torch.Tensor:
+        """encode images."""
+        image_features = self.vision_tower(images)
+        image_features = self.mm_projector(image_features)
+        return image_features
+
+    def preprocess(
+            self,
+            images: List[Image]) -> Union[torch.Tensor, List[torch.Tensor]]:
+        """preprocess."""
+        # TODO: gpu processor
+        from llava.mm_utils import process_images
+        images = [x.convert('RGB') for x in images]
+        image_processor = self.vision_tower.image_processor
+        outputs = process_images(images, image_processor, self.config)
+        return outputs
+
+    @torch.no_grad()
+    def forward(self, images: List[Image]) -> List[torch.Tensor]:
+        """forward."""
+        from llava.model.llava_arch import (get_anyres_image_grid_shape,
+                                            unpad_image)
+        image_sizes = [x.size for x in images]
+        images = self.preprocess(images)
+        if isinstance(images, list):
+            images = [x.to(self.device, dtype=torch.float16) for x in images]
+        else:
+            images = images.to(self.device, dtype=torch.float16)
+        if type(images) is list or images.ndim == 5:
+            if type(images) is list:
+                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
+            concat_images = torch.cat([image for image in images], dim=0)
+            image_features = self.encode_images(concat_images)
+            split_sizes = [image.shape[0] for image in images]
+            image_features = torch.split(image_features, split_sizes, dim=0)
+            mm_patch_merge_type = getattr(self.config, 'mm_patch_merge_type',
+                                          'flat')
+            image_aspect_ratio = getattr(self.config, 'image_aspect_ratio',
+                                         'square')
+            if mm_patch_merge_type == 'flat':
+                image_features = [x.flatten(0, 1) for x in image_features]
+            elif mm_patch_merge_type.startswith('spatial'):
+                new_image_features = []
+                for image_idx, image_feature in enumerate(image_features):
+                    if image_feature.shape[0] > 1:
+                        base_image_feature = image_feature[0]
+                        image_feature = image_feature[1:]
+                        height = width = self.vision_tower.num_patches_per_side
+                        assert height * width == base_image_feature.shape[0]
+                        if image_aspect_ratio == 'anyres':
+                            num_patch_width, num_patch_height = \
+                                get_anyres_image_grid_shape(
+                                    image_sizes[image_idx],
+                                    self.config.image_grid_pinpoints,
+                                    self.vision_tower.config.image_size)
+                            image_feature = image_feature.view(
+                                num_patch_height, num_patch_width, height,
+                                width, -1)
+                        else:
+                            raise NotImplementedError
+                        if 'unpad' in mm_patch_merge_type:
+                            image_feature = image_feature.permute(
+                                4, 0, 2, 1, 3).contiguous()
+                            image_feature = image_feature.flatten(1,
+                                                                  2).flatten(
+                                                                      2, 3)
+                            image_feature = unpad_image(
+                                image_feature, image_sizes[image_idx])
+                            image_feature = torch.cat((
+                                image_feature,
+                                self.model.image_newline[:, None, None].expand(
+                                    *image_feature.shape[:-1], 1).to(
+                                        image_feature.device)),
+                                                      dim=-1)
+                            image_feature = image_feature.flatten(1,
+                                                                  2).transpose(
+                                                                      0, 1)
+                        else:
+                            image_feature = image_feature.permute(
+                                0, 2, 1, 3, 4).contiguous()
+                            image_feature = image_feature.flatten(0, 3)
+                        image_feature = torch.cat(
+                            (base_image_feature, image_feature), dim=0)
+                    else:
+                        image_feature = image_feature[0]
+                        if 'unpad' in mm_patch_merge_type:
+                            image_feature = torch.cat(
+                                (image_feature,
+                                 self.model.image_newline[None].to(
+                                     image_feature.device)),
+                                dim=0)
+                    new_image_features.append(image_feature)
+                image_features = new_image_features
+            else:
+                raise ValueError('Unexpected mm_patch_merge_type: '
+                                 f'{self.config.mm_patch_merge_type}')
+        else:
+            image_features = self.encode_images(images)
+            image_features = [x for x in image_features]
+        return image_features
diff --git a/lmdeploy/vl/model/qwen.py b/lmdeploy/vl/model/qwen.py
new file mode 100644
index 0000000000..134086366c
--- /dev/null
+++ b/lmdeploy/vl/model/qwen.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+from typing import List
+
+import torch
+from accelerate import init_empty_weights
+from PIL.Image import Image
+from transformers import AutoConfig, AutoModelForCausalLM
+
+from lmdeploy.vl.model.base import VisonModel
+from lmdeploy.vl.model.utils import load_model_from_weight_files
+
+
+class QwenVisionModel(VisonModel):
+    """Qwen vision model."""
+
+    def __init__(self, model_path, device='cuda'):
+        self.model_path = model_path
+        self.device = device
+        self.build_model()
+
+    def build_model(self):
+        with init_empty_weights():
+            config = AutoConfig.from_pretrained(self.model_path,
+                                                trust_remote_code=True)
+            model = AutoModelForCausalLM.from_config(config,
+                                                     trust_remote_code=True)
+            del model.lm_head
+            for key in ['wte', 'h', 'ln_f']:
+                setattr(model.transformer, key, None)
+
+        with torch.device(self.device):
+            model.to_empty(device=self.device)
+            load_model_from_weight_files(model, self.model_path)
+
+        self.model = model.transformer.visual
+        self.model.eval().half()
+
+    @torch.no_grad()
+    def forward(self, images: List[Image]) -> List[torch.Tensor]:
+        """forward."""
+        outputs = [x.convert('RGB') for x in images]
+        outputs = [self.model.image_transform(x) for x in outputs]
+        outputs = torch.stack(outputs, dim=0)
+        outputs = self.model(outputs)
+        outputs = torch.split(outputs, 1, dim=0)
+        outputs = [x.squeeze() for x in outputs]
+        return outputs
diff --git a/lmdeploy/vl/model/utils.py b/lmdeploy/vl/model/utils.py
new file mode 100644
index 0000000000..cebe5a8205
--- /dev/null
+++ b/lmdeploy/vl/model/utils.py
@@ -0,0 +1,45 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from typing import Dict, List
+
+import torch
+import torch.nn as nn
+from safetensors.torch import load_file
+from transformers.utils import SAFE_WEIGHTS_INDEX_NAME, WEIGHTS_INDEX_NAME
+from transformers.utils.hub import get_checkpoint_shard_files
+
+
+def load_weight_ckpt(ckpt: str) -> Dict[str, torch.Tensor]:
+    """load checkpoint."""
+    if ckpt.endswith('.safetensors'):
+        return load_file(ckpt)
+    else:
+        return torch.load(ckpt)
+
+
+def get_used_weight_files(folder: str,
+                          state_dict: Dict[str, torch.Tensor]) -> List[str]:
+    """get used checkpoint which contains keys in state_dict."""
+    _index_file = os.path.join(folder, WEIGHTS_INDEX_NAME)
+    _safe_index_file = os.path.join(folder, SAFE_WEIGHTS_INDEX_NAME)
+    if os.path.exists(_index_file):
+        index_file = _index_file
+    elif os.path.exists(_safe_index_file):
+        index_file = _safe_index_file
+    else:
+        raise FileNotFoundError
+    _, sharded_metadata = get_checkpoint_shard_files(folder, index_file)
+    potential_keys = set(state_dict.keys())
+    supplied_keys = set(sharded_metadata['weight_map'].keys())
+    shared_keys = potential_keys & supplied_keys
+    valid_files = set(sharded_metadata['weight_map'][k] for k in shared_keys)
+    return valid_files
+
+
+def load_model_from_weight_files(model: nn.Module, folder: str) -> None:
+    """load nn.Module weight from folder."""
+    valid_files = get_used_weight_files(folder, model.state_dict())
+    for file_name in valid_files:
+        ckpt = os.path.join(folder, file_name)
+        state_dict = load_weight_ckpt(ckpt)
+        model.load_state_dict(state_dict, strict=False)
diff --git a/lmdeploy/vl/model/yi.py b/lmdeploy/vl/model/yi.py
new file mode 100644
index 0000000000..ca3bfbafa1
--- /dev/null
+++ b/lmdeploy/vl/model/yi.py
@@ -0,0 +1,126 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from contextlib import contextmanager
+from typing import MutableSequence
+
+import torch.nn as nn
+
+from lmdeploy.vl.model.llava import LlavaVisionModel, check_llava_install
+
+_model_path = None
+
+
+def _build_vision_projector(config, delay_load=False, **kwargs):
+    """build yi projector."""
+    # copy from https://github.com/01-ai/Yi/blob/main/VL/llava/model/multimodal_projector/builder.py # noqa: E501
+    projector_type = getattr(config, 'mm_projector_type', 'linear')
+
+    if projector_type == 'linear':
+        return nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+    import re
+    use_norm = False
+    if '_Norm' in projector_type:
+        use_norm = True
+        projector_type = projector_type.replace('_Norm', '')
+    mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
+    if mlp_gelu_match:
+        mlp_depth = int(mlp_gelu_match.group(1))
+        if use_norm:
+            modules = [
+                nn.Linear(config.mm_hidden_size, config.hidden_size),
+                nn.LayerNorm(config.hidden_size),
+            ]
+        else:
+            modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
+        for _ in range(1, mlp_depth):
+            modules.append(nn.GELU())
+            if use_norm:
+                modules.append(
+                    nn.Linear(config.hidden_size, config.hidden_size))
+                modules.append(nn.LayerNorm(config.hidden_size))
+            else:
+                modules.append(
+                    nn.Linear(config.hidden_size, config.hidden_size))
+        return nn.Sequential(*modules)
+
+    if projector_type == 'identity':
+        return nn.Identity()
+
+    raise ValueError(f'Unknown projector type: {projector_type}')
+
+
+def _build_vision_tower(vision_tower_cfg, **kwargs):
+    """build yi vision tower."""
+    cfg = vision_tower_cfg
+    vision_tower = getattr(cfg, 'mm_vision_tower',
+                           getattr(cfg, 'vision_tower', None))
+    if os.path.exists(os.path.join(_model_path, vision_tower)):
+        vision_tower = os.path.join(_model_path, vision_tower)
+
+    from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower
+    is_absolute_path_exists = os.path.exists(vision_tower)
+    if is_absolute_path_exists or vision_tower.startswith(
+            'openai') or vision_tower.startswith(
+                'laion') or 'ShareGPT4V' in vision_tower:
+        return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+
+    raise ValueError(f'Unknown vision tower: {vision_tower}')
+
+
+def _set_function(old_func, new_func):
+    import gc
+    refs = gc.get_referrers(old_func)
+    obj_id = id(old_func)
+    for ref in refs:
+        if isinstance(ref, dict):
+            for x, y in ref.items():
+                if id(y) == obj_id:
+                    ref[x] = new_func
+        elif isinstance(ref, MutableSequence):
+            for i, v in enumerate(ref):
+                if id(v) == obj_id:
+                    ref[i] = new_func
+
+
+@contextmanager
+def init_yi_model():
+    import llava  # noqa: F401
+    old_projector = eval(
+        'llava.model.multimodal_projector.builder.build_vision_projector')
+    _set_function(old_projector, _build_vision_projector)
+    old_vision_tower = eval(
+        'llava.model.multimodal_encoder.builder.build_vision_tower')
+    _set_function(old_vision_tower, _build_vision_tower)
+    yield
+    _set_function(_build_vision_projector, old_projector)
+    _set_function(_build_vision_tower, old_vision_tower)
+
+
+@contextmanager
+def disable_transformers_logging():
+    import transformers
+    from transformers.utils import logging
+    previous_level = logging.get_verbosity()
+    logging.set_verbosity(transformers.logging.ERROR)
+    yield
+    logging.set_verbosity(previous_level)
+
+
+class YiVisionModel(LlavaVisionModel):
+    """Yi visual model."""
+
+    def __init__(self, model_path, device='cuda'):
+        self.model_path = model_path
+        self.device = device
+        self.build_model()
+
+    def build_model(self):
+        """build model & load weights."""
+        check_llava_install()
+
+        global _model_path
+        _model_path = self.model_path
+
+        with init_yi_model(), disable_transformers_logging():
+            super().build_model()
diff --git a/lmdeploy/vl/templates.py b/lmdeploy/vl/templates.py
new file mode 100644
index 0000000000..976845d1aa
--- /dev/null
+++ b/lmdeploy/vl/templates.py
@@ -0,0 +1,143 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import asyncio
+from typing import Dict, List, Tuple, Union
+
+import PIL
+
+from lmdeploy.model import BaseModel
+from lmdeploy.utils import get_hf_config_content
+from lmdeploy.vl.constants import IMAGE_TOKEN
+from lmdeploy.vl.utils import encode_image_base64, load_image
+
+VLPromptType = Union[str, Tuple[str, PIL.Image.Image],
+                     Tuple[str, List[PIL.Image.Image]]]
+
+
+class VLChatTemplateWrapper:
+    """vl chat template wrapper."""
+
+    def __init__(self, chat_template: BaseModel):
+        self.chat_template = chat_template
+
+    def prompt_to_messages(self, prompt: VLPromptType):
+        """convert prompt to GTP4V format."""
+        messages = {
+            'role': 'user',
+            'content': [{
+                'type': 'text',
+                'text': '',
+            }]
+        }
+        if isinstance(prompt, str):
+            messages['content'][0]['text'] = prompt
+        else:
+            prompt, images = prompt
+            if not isinstance(images, list):
+                images = [images]
+            messages['content'][0]['text'] = prompt
+            for image in images:
+                if isinstance(image, str):
+                    image = load_image(image)
+                image_base64_data = encode_image_base64(image)
+                item = {
+                    'type': 'image_url',
+                    'image_url': {
+                        'url': f'data:image/jpeg;base64,{image_base64_data}'
+                    }
+                }
+                messages['content'].append(item)
+
+        return [messages]
+
+    async def async_collect_pil_images(
+            self, messages: Dict) -> List[PIL.Image.Image]:
+        """collect image from messages."""
+        images = []
+        for message in messages:
+            role = message['role']
+            content = message['content']
+            if role != 'user' or isinstance(content, str):
+                continue
+            for item in content:
+                if item['type'] != 'image_url':
+                    continue
+                url = item['image_url']['url']
+                images.append(url)
+
+        def _inner_call(i, images):
+            url = images[i]
+            images[i] = load_image(url)
+
+        await asyncio.gather(*[
+            asyncio.get_event_loop().run_in_executor(
+                None, _inner_call, i, images) for i in range(len(images))
+        ])
+
+        return images
+
+    def append_image_token(self, prompt, num_images: int):
+        """append image token to user prompt."""
+        return IMAGE_TOKEN * num_images + '\n' + prompt
+
+    def convert_messages(self, messages, sequence_start=True):
+        """convert GPT4V message format to GPT4 text format."""
+        new_messages = []
+        for message in messages:
+            role = message['role']
+            content = message['content']
+            if role != 'user' or isinstance(content, str):
+                new_messages.append(message)
+            num_images = 0
+            for item in content:
+                if item['type'] == 'image_url':
+                    num_images += 1
+                elif item['type'] == 'text':
+                    prompt = item['text']
+            new_item = {
+                'role': 'user',
+                'content': self.append_image_token(prompt, num_images)
+            }
+            new_messages.append(new_item)
+        return new_messages
+
+    def messages2prompt(self, messages, sequence_start=True) -> str:
+        """convert messages to decorated prompt."""
+        new_messages = self.convert_messages(messages, sequence_start)
+        return self.chat_template.messages2prompt(new_messages, sequence_start)
+
+
+class LlavaVLChatTemplateWrapper(VLChatTemplateWrapper):
+    """Llava vl chat template."""
+    pass
+
+
+class YiVLChatTemplateWrapper(VLChatTemplateWrapper):
+    """Yi vl chat template."""
+    pass
+
+
+class QwenVLChatTemplateWrapper(VLChatTemplateWrapper):
+    """Qwen vl chat template."""
+
+    def append_image_token(self, prompt, num_images: int):
+        """append image tokens to user prompt."""
+        res = ''
+        for i in range(num_images):
+            res += f'Picture {str(i)}:{IMAGE_TOKEN}\n'
+        res = res + prompt
+        return res
+
+
+def get_vl_prompt_template(model_path: str, chat_template: BaseModel,
+                           model_name: str) -> VLChatTemplateWrapper:
+    """get vision language prompt template."""
+    if model_name == 'yi-vl':
+        return YiVLChatTemplateWrapper(chat_template)
+
+    config = get_hf_config_content(model_path)
+    arch = config['architectures'][0]
+    if arch == 'QWenLMHeadModel':
+        return QwenVLChatTemplateWrapper(chat_template)
+    elif arch == 'LlavaLlamaForCausalLM':
+        return LlavaVLChatTemplateWrapper(chat_template)
+    raise ValueError(f'unsupported vl_prompt_template with arch {arch}')
diff --git a/lmdeploy/vl/utils.py b/lmdeploy/vl/utils.py
new file mode 100644
index 0000000000..927cd8cb0a
--- /dev/null
+++ b/lmdeploy/vl/utils.py
@@ -0,0 +1,41 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import base64
+from io import BytesIO
+from typing import Union
+
+import requests
+from PIL import Image
+
+
+def encode_image_base64(image: Image.Image) -> str:
+    """encode image to base64 format."""
+    buffered = BytesIO()
+    image.save(buffered, format='PNG')
+    return base64.b64encode(buffered.getvalue()).decode('utf-8')
+
+
+def load_image_from_base64(image: Union[bytes, str]) -> Image.Image:
+    """load image from base64 format."""
+    return Image.open(BytesIO(base64.b64decode(image)))
+
+
+def load_image(image_url: str) -> Image.Image:
+    """load image from url, local path or openai GPT4V."""
+
+    headers = {
+        'User-Agent':
+        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
+        '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
+    }
+    if image_url.startswith('http'):
+        response = requests.get(image_url, headers=headers)
+        response.raise_for_status()
+
+        # Open the image using PIL
+        img = Image.open(BytesIO(response.content))
+    elif image_url.startswith('data:image'):
+        img = load_image_from_base64(image_url.split(',')[1])
+    else:
+        img = Image.open(image_url)
+
+    return img
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
index c40fb11f35..e119fe8593 100644
--- a/requirements/runtime.txt
+++ b/requirements/runtime.txt
@@ -3,6 +3,7 @@ fire
 mmengine-lite
 numpy
 peft<=0.9.0
+pillow
 pydantic>2.0.0
 pynvml
 safetensors
diff --git a/tests/test_lmdeploy/test_tokenizer.py b/tests/test_lmdeploy/test_tokenizer.py
index 341b4b3ae2..55b0d70a85 100644
--- a/tests/test_lmdeploy/test_tokenizer.py
+++ b/tests/test_lmdeploy/test_tokenizer.py
@@ -46,3 +46,13 @@ def test_tokenizer_with_stop_words(model_path, stop_words):
     tokenizer = HuggingFaceTokenizer(model_path)
     indexes = tokenizer.indexes_containing_token(stop_words)
     assert indexes is not None
+
+
+def test_qwen_vl_decode_special():
+    from lmdeploy.tokenizer import Tokenizer
+    tok = Tokenizer('Qwen/Qwen-VL-Chat')
+    try:
+        tok.decode([151857])
+        assert (0)
+    except Exception as e:
+        assert str(e) == 'Unclosed image token'
diff --git a/tests/test_lmdeploy/test_vl_template.py b/tests/test_lmdeploy/test_vl_template.py
new file mode 100644
index 0000000000..9c5e80f952
--- /dev/null
+++ b/tests/test_lmdeploy/test_vl_template.py
@@ -0,0 +1,36 @@
+import PIL
+
+from lmdeploy.model import MODELS
+from lmdeploy.vl.constants import IMAGE_TOKEN
+from lmdeploy.vl.templates import VLChatTemplateWrapper
+
+
+def test_prompt_to_messages():
+    model = MODELS.get('vicuna')()
+    templtae = VLChatTemplateWrapper(model)
+    out = templtae.prompt_to_messages('hi')
+    assert isinstance(out, list) and isinstance(out[0], dict)
+    im = PIL.Image.new(mode='RGB', size=(200, 200))
+    out = templtae.prompt_to_messages(('hi', [im]))
+    assert isinstance(out, list) and isinstance(out[0], dict)
+
+
+def test_messages2prompt():
+    model = MODELS.get('vicuna')()
+    templtae = VLChatTemplateWrapper(model)
+    messages = [{
+        'role':
+        'user',
+        'content': [{
+            'type': 'text',
+            'text': 'hi'
+        }, {
+            'type': 'image_url',
+            'image_url': {
+                'url': 'xxx'
+            }
+        }]
+    }]
+    prompt = templtae.messages2prompt(messages)
+    assert isinstance(prompt, str)
+    assert prompt.count(IMAGE_TOKEN) == 1