Currently, this is only preliminary support and is under active development for better performance and more supported models.
The QNN backend relies on the Qualcomm QNN framework and Hexagon SDK to compile LLM-specific operators. Please download them using QPM. The compiling environment only supports Linux now.
Version requirements:
- QNN: Linux v2.20
- Hexagon SDK: Linux 5.5.0.1 (Some accounts may have no permission to access this SDK and may need to contact Qualcomm for support.)
After downloading and installing the two SDKs use "qpm-cli", copy the SDK directories into the following paths:
- mllm/src/backends/qnn/qualcomm_ai_engine_direct_220/
- mllm/src/backends/qnn/HexagonSDK/
To use QNN offload, the CPU & HTP QNN op package are needed, the following scripts will build QNN op package needed by the project.
export QNN_SDK_ROOT=mllm/src/backends/qnn/qualcomm_ai_engine_direct_220/
export ANDROID_NDK_ROOT=/path/to/your/ndk
export PATH=$PATH:$ANDROID_NDK_ROOT
source mllm/src/backends/qnn/HexagonSDK/setup_sdk_env.source
source $QNN_SDK_ROOT/bin/envsetup.sh
cd mllm/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/
make htp_aarch64 && make htp_v75
Build the target with QNN backend.
cd ../script
./build_qnn_android.sh
Currently, there are two style of modeling, the Module API and the old implementation. The demo of the Module API is in examples/demo_qwen_npu.cpp
which is in a user friendly style, and the old implementation is in examples/main_qwen_npu.cpp
which supports the chunk pipeline prefilling.
Download the model from here, or using the following instructions
mkdir ../models && cd ../models
# Download int8 model used by npu & q4k model used by cpu
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-int8.mllm?download=true -O qwen-1.5-1.8b-chat-int8.mllm
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-q4k.mllm?download=true -O qwen-1.5-1.8b-chat-q4k.mllm
Run on an android phone with at least 16GB of memory.
cd ../script
./run_qwen_npu.sh
There are two arguments in the executable. -s
is for the sequence length of prefilling, the default value is 64 in the demo we provided. -c
for type of QNN prefilling options, when it is set to 1, the input will be splited into many chunks of sequence 256 and be executed in a pipeline. When it is set to 0, the input will be executed in one chunk.
Result are as followed:
> ./main_qwen_npu -s 512 -c 1
[Q] <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
[A] The large language model is a type of artificial intelligence that is designed to generate human-like text based on the input it receives It is typically trained on large datasets of text, such as books, articles, and web pages, and uses statistical models to learn patterns and relationships in the data The goal of a large language model is to generate text that is coherent