Skip to content

Commit

Permalink
Merge branch 'main' into bump-version
Browse files Browse the repository at this point in the history
  • Loading branch information
lvhan028 committed Oct 25, 2024
2 parents 7c93a54 + 962e760 commit a5fd662
Show file tree
Hide file tree
Showing 119 changed files with 6,035 additions and 1,809 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ repos:
rev: v2.1.0
hooks:
- id: codespell
args: ["--skip=third_party/*,*.ipynb,*.proto,src/turbomind/kernels/gemm/transform.h,docker/Dockerfile_aarch64_ascend,docs/en/get_started/ascend/get_started.md,docs/zh_cn/get_started/ascend/get_started.md"]
args: ["--skip=third_party/*,*.ipynb,*.proto,src/turbomind/*,docker/Dockerfile_aarch64_ascend,docs/en/get_started/ascend/get_started.md,docs/zh_cn/get_started/ascend/get_started.md"]


- repo: https://github.com/myint/docformatter
Expand Down
13 changes: 8 additions & 5 deletions docker/Dockerfile_aarch64_ascend
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ $LD_LIBRARY_PATH
ARG CHIP=all
ARG TOOLKIT_PKG=Ascend-cann-toolkit_*.run
ARG KERNELS_PKG=Ascend-cann-kernels-*.run
ARG NNAL_PKG=Ascend-nnal_*.run

RUN --mount=type=cache,target=/tmp,from=build_temp,source=/tmp \
umask 0022 && \
Expand All @@ -83,10 +84,11 @@ RUN --mount=type=cache,target=/tmp,from=build_temp,source=/tmp \
else \
CHIPOPTION=""; \
fi && \
chmod +x $TOOLKIT_PKG $KERNELS_PKG && \
chmod +x $TOOLKIT_PKG $KERNELS_PKG $NNAL_PKG && \
./$TOOLKIT_PKG --quiet --install --install-path=$ASCEND_BASE --install-for-all $CHIPOPTION && \
./$KERNELS_PKG --quiet --install --install-path=$ASCEND_BASE --install-for-all && \
rm -f $TOOLKIT_PKG $KERNELS_PKG
./$NNAL_PKG --quiet --install --install-path=$ASCEND_BASE &&
rm -f $TOOLKIT_PKG $KERNELS_PKG $NNAL_PKG

ENV GLOG_v=2 \
LD_LIBRARY_PATH=$TOOLKIT_PATH/lib64:$LD_LIBRARY_PATH \
Expand All @@ -99,14 +101,15 @@ ENV PYTHONPATH=$TBE_IMPL_PATH:$PYTHONPATH

SHELL ["/bin/bash", "-c"]
RUN echo "source /usr/local/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
echo "source /usr/local/Ascend/nnal/atb/set_env.sh --cxx_abi=0" >> ~/.bashrc && \
. ~/.bashrc

# dlinfer
# transformers>=4.41.0 is required for internlm2 model
# timm is required for internvl2 model
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install transformers>=4.41.0 timm && \
pip3 install dlinfer-ascend==0.1.0.post1
pip3 install torch==2.3.1 torchvision==0.18.1 torch-npu==2.3.1 && \
pip3 install transformers timm && \
pip3 install dlinfer-ascend==0.1.1

# lmdeploy
FROM build_temp as copy_temp
Expand Down
2 changes: 1 addition & 1 deletion docs/en/get_started/ascend/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The target machine needs to install the Huawei driver and firmware version 23.0.
[CANN Driver and Firmware Installation](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha003/softwareinst/instg/instg_0019.html)
and [download resources](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC2.beta1&driver=1.0.25.alpha).

And the CANN (version 8.0.RC2.beta1) software packages should also be downloaded from [Ascend Resource Download Center](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1&product=4&model=26) themselves. Make sure to place the `Ascend-cann-kernels-910b*.run` and `Ascend-cann-toolkit*-aarch64.run` under the root directory of lmdeploy source code
And the CANN (version 8.0.RC2.beta1) software packages should also be downloaded from [Ascend Resource Download Center](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1&product=4&model=26) themselves. Make sure to place the `Ascend-cann-kernels-910b*.run`, `Ascend-cann-nnal_*.run` and `Ascend-cann-toolkit*-aarch64.run` under the root directory of lmdeploy source code

#### Build Docker Image

Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/get_started/ascend/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Docker 版本应不低于 18.03。并且需按照[官方指南](https://www.hias
[下载资源](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC2.beta1&driver=1.0.25.alpha)

另外,`docker/Dockerfile_aarch64_ascend`没有提供CANN 安装包,用户需要自己从[昇腾资源下载中心](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1&product=4&model=26)下载CANN(version 8.0.RC2.beta1)软件包。
并将``` Ascend-cann-kernels-910b*.run`` 和 ```Ascend-cann-toolkit\*-aarch64.run\`\` 放在 lmdeploy 源码根目录下。
并将`Ascend-cann-kernels-910b*.run``Ascend-cann-nnal_*.run``Ascend-cann-toolkit*.run` 放在 lmdeploy 源码根目录下。

#### 构建镜像

Expand Down
2 changes: 1 addition & 1 deletion lmdeploy/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ def pipeline(model_path: str,
backend_config: Optional[Union[TurbomindEngineConfig,
PytorchEngineConfig]] = None,
chat_template_config: Optional[ChatTemplateConfig] = None,
log_level: str = 'ERROR',
log_level: str = 'WARNING',
max_log_len: int = None,
**kwargs):
"""
Expand Down
21 changes: 21 additions & 0 deletions lmdeploy/pytorch/backends/dlinfer/activation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) OpenMMLab. All rights reserved.
from lmdeploy.pytorch.kernels.dlinfer.activation import silu_and_mul

from ..activation import SiluAndMulBuilder, SiluAndMulImpl


class DlinferSiluAndMulImpl(SiluAndMulImpl):
"""silu + multiple fused implementation."""

def forward(self, x):
"""forward."""
return silu_and_mul(x)


class DlinferSiluAndMulBuilder(SiluAndMulBuilder):
"""silu and mul implementation builder."""

@staticmethod
def build(inplace: bool = False):
"""build."""
return DlinferSiluAndMulImpl()
116 changes: 116 additions & 0 deletions lmdeploy/pytorch/backends/dlinfer/ascend/graph_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from importlib import import_module

import torch
import torch.distributed

from lmdeploy.pytorch.config import BackendConfig, CacheConfig, ModelConfig
from lmdeploy.utils import get_logger

from ...graph_runner import GraphRunner

logger = get_logger('lmdeploy')


class AscendGraphRunner(GraphRunner):
"""ascend graph runner."""

def __init__(self, model: torch.nn.Module, model_config: ModelConfig,
cache_config: CacheConfig, backend_config: BackendConfig,
device: torch.device):
super().__init__(model, model_config, cache_config, backend_config,
device)

self.enable_graph = self.check_enable_graph()
if self.enable_graph:
import dlinfer.graph
dlinfer.graph.config.enable_graph_mode = True
self.patch_kernels_custom_op()
self.patch_kvcache_static_shape()
self.model = torch.compile(self.model,
fullgraph=True,
dynamic=True,
backend='atbgraph')

def check_enable_graph(self):
"""check enable graph."""
# eager_mode
if self.backend_config.eager_mode:
return False
# tp
if torch.distributed.is_initialized():
warnings.warn(
"Graph mode of device_type 'ascend' only supports tp=1 "
'for now, fallback to eager mode', RuntimeWarning)
return False
# model support
self.supported_model = {
'Llama2': 'LlamaConfig',
'InternLM2': 'InternLM2Config',
'Qwen2': 'Qwen2Config',
}
is_model_support = True
model_config_name = str(type(self.model_config.hf_config).__name__)
if model_config_name not in self.supported_model.values():
is_model_support = False
if not is_model_support:
warnings.warn(
"Graph mode of device_type 'ascend' only supports models: "
f"{', '.join(self.supported_model.keys())} when tp=1 for now",
RuntimeWarning)
return True

def patch_kernels_custom_op(self):
from dlinfer.graph.custom_op import register_custom_op
dlinfer_kernels_module = import_module(
'lmdeploy.pytorch.kernels.dlinfer')
dlinfer_backends_module = import_module(
'lmdeploy.pytorch.backends.dlinfer')

# prefill_attention
module_str = 'pagedattention'
paged_attn_module = getattr(dlinfer_kernels_module, module_str)
func_str = 'prefill_attention'
prefill_attn_origin = getattr(paged_attn_module, func_str)
prefill_attn_registered = register_custom_op(
f'lmdeploy::{func_str}', ['attn_output'])(prefill_attn_origin)
setattr(paged_attn_module, func_str, prefill_attn_registered)

# apply_rotary_pos_emb
def apply_rotary_emb_abstract_impl(q, k, cos, sin, q_out, k_out):
result = [q, k]
if q_out is not None:
result[0] = q_out
if k_out is not None:
result[1] = k_out
return tuple(result)

module_str = 'apply_rotary_emb'
apply_rotary_emb_module = getattr(dlinfer_backends_module, module_str)
func_str = 'apply_rotary_pos_emb'
apply_rotary_pos_emb_origin = getattr(apply_rotary_emb_module,
func_str)
apply_rotary_pos_emb_registered = register_custom_op(
f'lmdeploy::{func_str}',
impl_abstract_func=apply_rotary_emb_abstract_impl)(
apply_rotary_pos_emb_origin)
setattr(apply_rotary_emb_module, func_str,
apply_rotary_pos_emb_registered)

def patch_kvcache_static_shape(self):
import torch._dynamo as dynamo
from torch.utils._pytree import tree_map
cache_engine_module = import_module(
'lmdeploy.pytorch.engine.cache_engine')
class_str = 'CacheEngine'
cache_engine_class = getattr(cache_engine_module, class_str)
func_str = 'allocate_gpu_cache'
allocate_gpu_cache_origin = getattr(cache_engine_class, func_str)

def allocate_gpu_cache_mark_static(self):
gpu_cache = allocate_gpu_cache_origin(self)
tree_map(lambda x: dynamo.mark_static(x), gpu_cache)
return gpu_cache

setattr(cache_engine_class, func_str, allocate_gpu_cache_mark_static)
99 changes: 74 additions & 25 deletions lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import torch

from lmdeploy.pytorch.config import BackendConfig, CacheConfig, ModelConfig
from lmdeploy.utils import get_logger

from ..op_backend import DlinferOpsBackend
Expand All @@ -12,6 +13,9 @@

class AscendOpsBackend(DlinferOpsBackend):
"""ascend layer backend."""
enable_graph = False
half_negative_inf = torch.finfo(torch.float16).min
total_slots = None

@staticmethod
def get_name() -> str:
Expand Down Expand Up @@ -45,21 +49,23 @@ def get_v_block_shape(
@classmethod
def update_step_context(cls, step_context):
"""update step context."""

def get_total_slots():
if cls.total_slots is None:
cls.total_slots = torch.arange(
block_num * block_size,
dtype=torch.long,
device=step_context.block_offsets.device)
cls.total_slots = cls.total_slots.view(block_num, block_size)
return cls.total_slots

kv_start_indices, attention_mask = [], []
block_num, block_size, _ = step_context.kv_caches[0][0].shape
device = step_context.block_offsets.device

is_unpaged_prefill = False
if not step_context.is_decoding:
is_unpaged_prefill = \
all((step_context.q_seqlens ==
step_context.kv_seqlens).tolist())

total_slots = torch.arange(block_num * block_size,
dtype=torch.long,
device=device)
total_slots = total_slots.view(block_num, block_size)

q_seqlens_list = step_context.q_seqlens.tolist()
kv_seqlens_list = step_context.kv_seqlens.tolist()
max_q_seq_len = max(q_seqlens_list)
Expand All @@ -71,9 +77,9 @@ def update_step_context(cls, step_context):

# collect kv start indices.
history_length = kv_seq_len - q_seq_len
slot_tables = total_slots[step_context.block_offsets[i]].flatten()
slot_indices = [p for p in range(history_length, kv_seq_len)]
slots = slot_tables[slot_indices].reshape((-1, 1))
total_slots = get_total_slots()
slot_tables = total_slots[step_context.block_offsets[i]].view(-1)
slots = slot_tables[history_length:kv_seq_len]
kv_start_indices.append(slots)

# collect attention mask of paged_prefill attention stage.
Expand All @@ -83,19 +89,19 @@ def update_step_context(cls, step_context):
torch.ones(q_seq_len,
step_context.block_offsets.shape[1] *
block_size,
dtype=torch.bool).cuda(),
dtype=torch.bool,
device=step_context.block_offsets.device),
diagonal=kv_seq_len - q_seq_len,
))
attention_mask.append(single_attention_mask)

kv_start_indices = torch.cat(kv_start_indices)

if step_context.is_decoding:
# prepare somae params of paged_decode attention stage.
# prepare some params of paged_decode attention stage.
q_start_loc_cpu, q_seqlens_cpu = None, None
kv_seqlens_cpu = step_context.kv_seqlens.cpu()
elif is_unpaged_prefill:
# prepare somae params of unpaged_prefill attention stage.
# prepare some params of unpaged_prefill attention stage.
q_start_loc_cpu, kv_seqlens_cpu = None, None
q_seqlens_cpu = step_context.q_seqlens.cpu()
single_attention_mask = torch.logical_not(
Expand All @@ -106,24 +112,54 @@ def update_step_context(cls, step_context):
))
attention_mask.append(single_attention_mask)
else:
# prepare somae params of paged_prefill attention stage.
# prepare some params of paged_prefill attention stage.
q_start_loc_cpu, q_seqlens_cpu = None, None
kv_seqlens_cpu = step_context.kv_seqlens.repeat_interleave(
step_context.q_seqlens, 0).cpu()
block_offsets_int32 = step_context.block_offsets.to(torch.int32)
step_context.block_offsets = block_offsets_int32.repeat_interleave(
step_context.q_seqlens, 0)
attention_mask = [
torch.cat([mask for mask in attention_mask]).unsqueeze(1)
]
attention_mask = [torch.cat([mask for mask in attention_mask])]

if cls.enable_graph:
kv_start_indices = kv_start_indices.view(-1).to(torch.int32)
import torch._dynamo as dynamo
if not is_unpaged_prefill:
step_context.block_offsets = step_context.block_offsets.to(
torch.int32)
if not step_context.is_decoding:
step_context.block_offsets = step_context.block_offsets\
.repeat_interleave(step_context.q_seqlens, 0)
dynamo.mark_dynamic(step_context.block_offsets, [0, 1])
kv_seqlens = step_context.kv_seqlens.to(torch.int32)
if not step_context.is_decoding:
if is_unpaged_prefill:
attention_mask = [mask.half() for mask in attention_mask]
else:
attention_mask = [
torch.cat([
mask.half() * cls.half_negative_inf
for mask in attention_mask
]).unsqueeze(1)
]
kv_seqlens = kv_seqlens.repeat_interleave(
step_context.q_seqlens, 0)
else:
if step_context.is_decoding:
kv_seqlens_cpu = step_context.kv_seqlens.cpu()
elif is_unpaged_prefill:
pass
else:
kv_seqlens_cpu = step_context.kv_seqlens.repeat_interleave(
step_context.q_seqlens, 0).cpu()
block_offsets_int32 = step_context.block_offsets.to(
torch.int32)
step_context.block_offsets = block_offsets_int32\
.repeat_interleave(step_context.q_seqlens, 0)
kv_seqlens = kv_seqlens_cpu

attn_meta_cls = cls.get_attention_metadata_cls()
attn_metadata = attn_meta_cls(
step_context.is_decoding,
step_context.block_offsets,
q_start_loc=q_start_loc_cpu,
q_seqlens=q_seqlens_cpu,
kv_seqlens=kv_seqlens_cpu,
kv_seqlens=kv_seqlens,
kv_start_indices=kv_start_indices,
block_size=block_size,
attention_mask=attention_mask,
Expand All @@ -134,3 +170,16 @@ def update_step_context(cls, step_context):

step_context.attn_metadata = attn_metadata
return step_context

@staticmethod
def build_graph_runner(model: torch.nn.Module, model_config: ModelConfig,
cache_config: CacheConfig,
backend_config: BackendConfig,
device: torch.device):
"""build graph runner."""
from .graph_runner import AscendGraphRunner
ascend_graph_runner = AscendGraphRunner(model, model_config,
cache_config, backend_config,
device)
AscendOpsBackend.enable_graph = ascend_graph_runner.enable_graph
return ascend_graph_runner
Loading

0 comments on commit a5fd662

Please sign in to comment.