Skip to content

Commit

Permalink
Support molmo in turbomind (#2716)
Browse files Browse the repository at this point in the history
* initial moe support

* dynamic grouped gemm

* benchmark

* moe benchmark

* moe sampling

* split-k

* refactor tuning

* simplify

* n-major weight

* add `num` for `MatrixLayout`

* packed rows

* packed cols

* dispatch for packed rows

* w4a16 moe

* refactor model loading

* fix pytorch loader

* refactor

* dispatch w4a16 moe

* fix loader

* add comment

* fix msvc build

* fix msvc build

* fix msvc build

* fix ut

* fix ut

* fix p-lora

* add all support arches

* minor

* fix lint

* fix lint

* fix lint

* fix ut

* bf16 support

* minor

* checkin molmo conversion

* add chat template

* refactor

* fix lint

* fix ut

* Just for test: hardcode vocab_size

* minor

* minor

* minor

* fix inter_size config

* load with non-standard filenames

* fix loader

* fix missing default param

* defer the loading of misc weights for safetensors

* add embedding_size

* update

* update

* tmp

* tmp

* update molmo template

* vision embedding

* fix

* update

* fix

* fix messages2prompt in templates

* fix order of out_messages

* fix

* add user guide

* update is_supported

---------

Co-authored-by: Li Zhang <[email protected]>
  • Loading branch information
lvhan028 and lzhangzz authored Nov 14, 2024
1 parent a21def9 commit fd8906c
Show file tree
Hide file tree
Showing 19 changed files with 653 additions and 8 deletions.
2 changes: 2 additions & 0 deletions docs/en/multi_modal/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,5 @@ Vision-Language Models
minicpmv.md
phi3.md
mllama.md
qwen2_vl.md
molmo.md
92 changes: 92 additions & 0 deletions docs/en/multi_modal/molmo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Molmo

LMDeploy supports the following molmo series of models, which are detailed in the table below:

| Model | Size | Supported Inference Engine |
| :-------------: | :--: | :------------------------: |
| Molmo-7B-D-0924 | 7B | TurboMind |
| Molmo-72-0924 | 72B | TurboMind |

The next chapter demonstrates how to deploy a molmo model using LMDeploy, with [Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) as an example.

## Installation

Please install LMDeploy by following the [installation guide](../get_started/installation.md)

## Offline inference

The following sample code shows the basic usage of VLM pipeline. For detailed information, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('allenai/Molmo-7B-D-0924')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe((f'describe this image', image))
print(response)
```

More examples are listed below:

<details>
<summary>
<b>multi-image multi-round conversation, combined images</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('allenai/Molmo-7B-D-0924', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text='Describe the two images in detail.'),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(do_sample=False))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(do_sample=False))
```

</details>

## Online serving

You can launch the server by the `lmdeploy serve api_server` CLI:

```shell
lmdeploy serve api_server allenai/Molmo-7B-D-0924
```

You can also start the service using the docker image:

```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server allenai/Molmo-7B-D-0924
```

If you find the following logs, it means the service launches successfully.

```text
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [2439]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
```

The arguments of `lmdeploy serve api_server` can be reviewed in detail by `lmdeploy serve api_server -h`.

More information about `api_server` as well as how to access the service can be found from [here](api_server_vl.md)
2 changes: 2 additions & 0 deletions docs/zh_cn/multi_modal/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,5 @@
minicpmv.md
phi3.md
mllama.md
qwen2_vl.md
molmo.md
92 changes: 92 additions & 0 deletions docs/zh_cn/multi_modal/molmo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Qwen2-VL

LMDeploy 支持 Molmo 系列模型,具体如下:

| Model | Size | Supported Inference Engine |
| :-------------: | :--: | :------------------------: |
| Molmo-7B-D-0924 | 7B | TurboMind |
| Molmo-72-0924 | 72B | TurboMind |

本文将以[Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) 为例,演示使用 LMDeploy 部署 Molmo 系列模型的方法

## 安装

请参考[安装文档](../get_started/installation.md)安装 LMDeploy。

## 离线推理

以下是使用 pipeline 进行离线推理的示例,更多用法参考[VLM离线推理 pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('allenai/Molmo-7B-D-0924')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe((f'describe this image', image))
print(response)
```

更多例子如下:

<details>
<summary>
<b>多图多轮对话</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2-VL-2B-Instruct', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text='Describe the two images in detail.'),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>

## 在线服务

你可以通过 `lmdeploy serve api_server` CLI 工具启动服务:

```shell
lmdeploy serve api_server Qwen/Qwen2-VL-2B-Instruct
```

也可以基于 docker image 启动服务:

```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:qwen2vl \
lmdeploy serve api_server Qwen/Qwen2-VL-2B-Instruct
```

如果日志中有如下信息,就表明服务启动成功了。

```text
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [2439]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
```

有关 `lmdeploy serve api_server` 的详细参数可以通过`lmdeploy serve api_server -h`查阅。

关于 `api_server` 更多的介绍,以及访问 `api_server` 的方法,请阅读[此处](api_server_vl.md)
3 changes: 2 additions & 1 deletion lmdeploy/archs.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,8 @@ def check_vl_llm(config: dict) -> bool:
'InternVLChatModel', 'MiniGeminiLlamaForCausalLM',
'MGMLlamaForCausalLM', 'MiniCPMV', 'LlavaForConditionalGeneration',
'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM',
'Qwen2VLForConditionalGeneration', 'MllamaForConditionalGeneration'
'Qwen2VLForConditionalGeneration', 'MllamaForConditionalGeneration',
'MolmoForCausalLM'
])
if arch == 'QWenLMHeadModel' and 'visual' in config:
return True
Expand Down
31 changes: 31 additions & 0 deletions lmdeploy/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -1747,6 +1747,37 @@ def match(cls, model_path: str) -> Optional[str]:
return 'internvl-phi3'


@MODELS.register_module(name='molmo')
class Molmo(BaseChatTemplate):

def __init__(self,
user=' User: ',
eoh='',
assistant=' Assistant:',
eoa='',
separator=' ',
stop_words=['<|endoftext|>'],
**kwargs):
super().__init__(user=user,
eoh=eoh,
assistant=assistant,
eoa=eoa,
separator=separator,
stop_words=stop_words,
**kwargs)

@classmethod
def match(cls, model_path: str) -> Optional[str]:
"""Return the model_name that was registered to MODELS.
Args:
model_path (str): the model path used for matching.
"""
path = model_path.lower()
if 'molmo' in path:
return 'molmo'


def best_match_model(query: str) -> Optional[str]:
"""Get the model that matches the query.
Expand Down
5 changes: 5 additions & 0 deletions lmdeploy/serve/vl_async_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ async def _get_prompt_input(self,
results = {}
input_ids = []
from lmdeploy.vl.templates import (MllamaTempateWrapper,
MolmoChatTemplateWrapper,
Qwen2VLChatTemplateWrapper)
ranges = None
grid_thws = None
Expand Down Expand Up @@ -99,6 +100,10 @@ async def _get_prompt_input(self,
results['cross_attention_states'] = features[0]
return results

if isinstance(self.vl_prompt_template,
MolmoChatTemplateWrapper):
return features[0]

features = [x.cpu().numpy() for x in features]
input_ids = []
begins = []
Expand Down
7 changes: 7 additions & 0 deletions lmdeploy/turbomind/deploy/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ class ModelConfig:
kv_head_num: int = None
hidden_units: int = None
vocab_size: int = None
# Turbomind used to assume token_embedding and lm_head has the same size
# at vocab dim, i.e. `vocab_size`
# But in molmo, embedding.shape is [vocab_size + 128, hidden_units]
# while lm_head shape is [hidden_units, vocab_size].
# Therefore, we add a new attr "embedding_size" to represent the vocab dim
# of token_embedding
embedding_size: int = 0
num_layer: int = None
inter_size: int = None
norm_eps: float = None
Expand Down
1 change: 1 addition & 0 deletions lmdeploy/turbomind/deploy/source_model/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@
from .meta_llama import MetaLlamaModel # noqa: F401
from .minicpmv import MiniCPMVModel # noqa: F401
from .mixtral import MixtralModel # noqa: F401
from .molmo import MolmoModel # noqa: F401
from .qwen import QwenModel # noqa: F401
from .xcomposer2 import Xcomposer2Model # noqa: F401
Loading

0 comments on commit fd8906c

Please sign in to comment.