Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: support LlavaForConditionalGeneration with turbomind inference #2710

Merged
merged 9 commits into from
Nov 8, 2024
172 changes: 172 additions & 0 deletions docs/en/multi_modal/llava_qwen.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Llava-Qwen2

LMDeploy supports the following llava-qwen2 series of models, which are detailed in the table below:

| Model | Size | Supported Inference Engine |
| :-------------------------: | :--: | :------------------------: |
| Llava-interleave-qwen-7b-hf | 7B | TurboMind |

The next chapter demonstrates how to deploy an LlavaQwen2 model using LMDeploy, with [LlavaQwen2](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) as an example.

## Installation

Please install LMDeploy by following the [installation guide](../get_started/installation.md).

Or, you can build a docker image to set up the inference environment. If the CUDA version on your host machine is `>=12.4`, you can run:

```
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
docker build --build-arg CUDA_VERSION=cu12 -t openmmlab/lmdeploy:llava_qwen2 . -f ./docker/Dockerfile
```

Otherwise, you can go with:

```shell
docker build --build-arg CUDA_VERSION=cu11 -t openmmlab/lmdeploy:llava_qwen2 . -f ./docker/Dockerfile
deepindeed2022 marked this conversation as resolved.
Show resolved Hide resolved
```

## Offline inference

The following sample code shows the basic usage of VLM pipeline. For detailed information, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline
from lmdeploy.vl import load_image


pipe = pipeline("llava-hf/llava-interleave-qwen-7b-hf", backend_config=TurbomindEngineConfig(cache_max_entry_count=0.5),
gen_config=GenerationConfig(max_new_tokens=512))

image = load_image('https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg')
prompt = 'Describe the image.'
print(f'prompt:{prompt}')
response = pipe((prompt, image))
print(response)

```

More examples are listed below:

<details>
<summary>
<b>multi-image multi-round conversation, combined images</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('llava-hf/llava-interleave-qwen-7b-hf', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text='Describe the two images in detail.'),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>

<details>
<summary>
<b>image resolution for performance boost</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('llava-hf/llava-interleave-qwen-7b-hf', log_level='INFO')

min_pixels = 64 * 28 * 28
max_pixels = 64 * 28 * 28
messages = [
dict(role='user', content=[
dict(type='text', text='Describe the two images in detail.'),
dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>
deepindeed2022 marked this conversation as resolved.
Show resolved Hide resolved

## Online serving

You can launch the server by the `lmdeploy serve api_server` CLI:

```shell
lmdeploy serve api_server llava-hf/llava-interleave-qwen-7b-hf
```

You can also start the service using the aforementioned built docker image:

```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:llava_qwen2 \
lmdeploy serve api_server llava-hf/llava-interleave-qwen-7b-hf
```

The docker compose is another option. Create a `docker-compose.yml` configuration file in the root directory of the lmdeploy project as follows:

```yaml
version: '3.5'

services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy:llava_qwen2
ports:
- "23333:23333"
environment:
HUGGING_FACE_HUB_TOKEN: <secret>
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
stdin_open: true
tty: true
ipc: host
command: lmdeploy serve api_server llava-hf/llava-interleave-qwen-7b-hf
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: "all"
capabilities: [gpu]
```

Then, you can execute the startup command as below:

```shell
docker-compose up -d
```

If you find the following logs after running `docker logs -f lmdeploy`, it means the service launches successfully.

```text
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [2439]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
```

The arguments of `lmdeploy serve api_server` can be reviewed in detail by `lmdeploy serve api_server -h`.

More information about `api_server` as well as how to access the service can be found from [here](api_server_vl.md)
1 change: 1 addition & 0 deletions lmdeploy/turbomind/deploy/source_model/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .internlm2 import InternLM2Model # noqa: F401
from .internvl import InternVLModel # noqa: F401
from .llama import LlamaModel # noqa: F401
from .llava_qwen2 import LlavaQwen2Model # noqa: F401
from .meta_llama import MetaLlamaModel # noqa: F401
from .minicpmv import MiniCPMVModel # noqa: F401
from .mixtral import MixtralModel # noqa: F401
Expand Down
77 changes: 77 additions & 0 deletions lmdeploy/turbomind/deploy/source_model/llava_qwen2.py
deepindeed2022 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright (c) OpenMMLab. All rights reserved.
import json
import os.path as osp

from .base import INPUT_MODELS
from .llama import LlamaModel, LlamaReader


class LlavaQwen2Reader(LlamaReader):
"""LlavaQwen2Reader for llama model."""

attn_layer_prefix = 'language_model.model.layers'
attn_layer_patten = r'language_model.model.layers.([0-9]+).'
tok_embeddings_key = 'language_model.model.embed_tokens.weight'
norm_weight_key = 'language_model.model.norm.weight'
output_weight_key = 'language_model.lm_head.weight'

def __init__(self, new_params: dict, unused_params: dict, last_bin: bool,
model_cfg: dict, policy):
model_cfg = model_cfg.get('text_config')
super().__init__(new_params, unused_params, last_bin, model_cfg,
policy)


@INPUT_MODELS.register_module(name='llava_qwen2')
class LlavaQwen2Model(LlamaModel):
"""LlavaQwen2Model model in hf format."""

def __init__(self, model_path: str, tokenizer_path: str, **kwargs):
super().__init__(model_path, tokenizer_path, **kwargs)
self.Reader = LlavaQwen2Reader

def model_info(self):
"""Read model info."""
params_path = osp.join(self.model_path, 'config.json')
with open(params_path) as f:
model_arg = json.load(f)['text_config']
num_layer = model_arg.get('num_hidden_layers', 32)
norm_eps = model_arg.get('rms_norm_eps', 1e-6)
attn_head_num = model_arg.get('num_attention_heads', 32)
if 'num_key_value_heads' in model_arg:
kv_head_num = model_arg.get('num_key_value_heads', 32)
else:
kv_head_num = model_arg.get('num_attention_heads', 32)
rope_theta = float(model_arg.get('rope_theta', 10000.0))
max_position_embeddings = int(
model_arg.get('max_position_embeddings', 0))
rope_scaling = model_arg.get('rope_scaling', None)
scaling_factor = 0.0
use_dynamic_ntk = 0

# special for the model: llava-hf/llava-interleave-qwen-7b-hf
hidden_units = model_arg.get('hidden_size', 4096)
vocab_size = model_arg.get('vocab_size', 152000)
intermediate_size = model_arg.get('intermediate_size', 11008)
attn_bias = int(model_arg.get('attn_bias', 1))
use_logn_attn = int(model_arg.get('use_logn_attn', 0))

if isinstance(rope_scaling, dict):
scaling_type = model_arg['rope_scaling'].get('type', '')
scaling_factor = model_arg['rope_scaling'].get('factor', '')
if scaling_type == 'dynamic':
use_dynamic_ntk = 1

return dict(num_layer=num_layer,
norm_eps=norm_eps,
head_num=attn_head_num,
hidden_units=hidden_units,
kv_head_num=kv_head_num,
rope_theta=rope_theta,
max_position_embeddings=max_position_embeddings,
use_dynamic_ntk=use_dynamic_ntk,
rope_scaling_factor=scaling_factor,
inter_size=intermediate_size,
use_logn_attn=use_logn_attn,
attn_bias=attn_bias,
vocab_size=vocab_size)
21 changes: 17 additions & 4 deletions lmdeploy/turbomind/generate_gemm_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,23 @@ def main(head_num: int = 32,
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path,
trust_remote_code=True)
head_num = config.num_attention_heads
size_per_head = config.hidden_size // head_num
inter_size = config.intermediate_size
vocab_size = config.vocab_size
try:
head_num = config.num_attention_heads
size_per_head = config.hidden_size // head_num
inter_size = config.intermediate_size
vocab_size = config.vocab_size
except AttributeError as e:
if hasattr(config, 'text_config'):
config = config.text_config
elif hasattr(config, 'llm_config'):
config = config.llm_config
deepindeed2022 marked this conversation as resolved.
Show resolved Hide resolved
else:
raise AttributeError(f'not found attribute in {config},\
please check your model config file.{e}')
head_num = config.num_attention_heads
size_per_head = config.hidden_size // head_num
inter_size = config.intermediate_size
vocab_size = config.vocab_size
for bsz in range(1, max_batch_size + 1):
subprocess.call(
f'{get_llama_gemm()} {bsz} 1 1 {head_num} {size_per_head}'
Expand Down
4 changes: 4 additions & 0 deletions lmdeploy/turbomind/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
# llava
LlavaLlamaForCausalLM='llama',
LlavaMistralForCausalLM='llama',
# Llava_interleave
LlavaForConditionalGeneration='llava_qwen2',
# xcomposer2
InternLMXComposer2ForCausalLM='xcomposer2',
# internvl
Expand Down Expand Up @@ -99,5 +101,7 @@ def _is_head_dim_128(cfg):
elif arch == 'InternVLChatModel':
# internvl2-4b,internlm2-1b are not working yet
support_by_turbomind = _is_head_dim_128(cfg.llm_config)
elif arch == 'LlavaForConditionalGeneration':
support_by_turbomind = _is_head_dim_128(cfg.text_config)
deepindeed2022 marked this conversation as resolved.
Show resolved Hide resolved

return support_by_turbomind
Loading