mllm

fast and lightweight multimodal LLM inference engine for mobile and edge devices

| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |

Plain C/C++ implementation without dependencies
Optimized for multimodal LLMs like Qwen2-VL and LLaVA
Supported: ARM NEON, x86 AVX2, Qualcomm NPU (QNN), etc
Various quantization schemes
End-to-end Android app demo
Advanced support: MoE, Prompt Cache, etc..

mllm is a lightweight, fast, and easy-to-use (multimodal) on-device LLM inference engine for mobile devices (mainly supporting CPU/NPU), initiated by the research groups led by Mengwei Xu (BUPT) and Xuanzhe Liu (PKU).

Recent update

[2024 November 21] Support new model: Phi 3 Vision #186
[2024 August 30] Support new model: MiniCPM 2B #132
[2024 August 15] Support new model: Phi 3 mini #119
[2024 Aug 10] Supporting Qualcomm NPU: #112 | try it out | paper

Android Demo

Android Intent Invocation	Image Understanding
PhoneLM_Call.mp4	Fuyu.mp4
Chat CPU	Chat NPU
QWen1.5_Chat_CPU.mp4	QWen1.5_Chat_NPU.mp4

Support models

Language models

Model	CPU FP32	CPU INT4	Hexagon NPU INT8
LLaMA 2 7B	✔️	✔️
LLaMA 3 1B	✔️	✔️
LLaMA 3 3B	✔️	✔️
Alpaca 7B	✔️	✔️
TinyLLaMA 1.1B	✔️	✔️
LLaVA 7B	✔️	✔️
Gemma 2B	✔️	✔️
Gemma 2 2B	✔️	✔️
Qwen 1.5 0.5B	✔️	✔️
Qwen 1.5 1.8B	✔️	✔️	✔️
Qwen 2.5 1.5B	✔️	✔️
Qwen 3 0.6B	✔️	✔️
Mistral 7B	✔️	✔️
Yi 6B	✔️	✔️
StableLM 2 1.6B	✔️	✔️
OPT 1.3B	✔️	✔️
Phi 3 mini 3.8B	✔️	✔️
MiniCPM 2B	✔️	✔️
MiniCPM 3 4B	✔️	✔️
MiniCPM MoE 8x2B	✔️	✔️
SmolLM 1.7B	✔️	✔️
DCLM 1B	✔️	✔️
OpenELM 1.1B	✔️	✔️
PhoneLM 1.5B	✔️	✔️	✔️

Multimodal models

Model	CPU FP32	CPU INT4
Fuyu 8B	✔️	✔️
Vision Transformer	✔️	✔️
CLIP	✔️	✔️
ImageBind (3 modalities)	✔️	✔️
LLaVA 7B	✔️	✔️
Phi-3-Vision	✔️	✔️
Qwen2-VL 2B	✔️	✔️

Quick Start

Get the Code

git clone https://github.com/UbiquitousLearning/mllm
cd mllm

Check prerequisites

Building mllm requires following tools:

gcc(11.4+) / clang (11.0+)
CMake >= 3.18
Android NDK Toolchains >= 26

Note that building OpenMP libs on macOS may fail due to Apple LLVM compiler, so we disable OpenMP on macOS by default, you may experience slower performance on macOS. Build mllm is more recommended on Linux.

Run Qwen with Hexagon NPU accelerating using QNN

NOTE: The QNN backend is preliminary version which can do end-to-end inference. It is still under active development for better performance and more supported models.

We support running Qwen-1.5-1.8B-Chat using Qualcomm QNN to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is here. The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.

Build the target with QNN backend.

cd ../script
./build_qnn_android.sh

Download the model from here, or using the following instructions

mkdir ../models && cd ../models
# Download int8 model used by npu & q4k model used by cpu
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-int8.mllm?download=true  -O qwen-1.5-1.8b-chat-int8.mllm
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-q4k.mllm?download=true  -O qwen-1.5-1.8b-chat-q4k.mllm

Run on an android phone with at least 16GB of memory.

cd ../script
./run_qwen_npu.sh

There are two arguments in the executable. -s is for the sequence length of prefilling, the default value is 64 in the demo we provided. -c for type of QNN prefilling options, when it is set to 1, the input will be splited into many chunks of sequence 32 and be executed in a pipeline. When it is set to 0, the input will be executed in one chunk.

Result are as followed:

> ./main_qwen_npu -s 64 -c 1
[Q] <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant

[A] A short introduction to a large language model is a type of artificial intelligence language model that is designed to understand and generate human language text. These models are typically trained on large amounts of text data, such as books, articles, and other written materials, to learn the patterns and structures of human language. They use a combination of natural language processing (NLP)

Run with the CPU of Android

Build

export ANDROID_NDK=/path/to/your/ndk
cd scripts
./build_android.sh

Run Fuyu-8B

Download the model from here, or using the following instructions

mkdir ../models && cd ../models
# Download fuyu-8b-q4_k.mllm
wget https://huggingface.co/mllmTeam/fuyu-8b-mllm/resolve/main/fuyu-8b-q4_k.mllm?download=true  -O fuyu-8b-q4_k.mllm

Run on an android phone with at least 12GB of memory.

cd ../script
./run_fuyu.sh

Result are as followed:

> ./demo_fuyu
[Q] [../assets/bus.png]Generate a coco-style caption.

[A]  A blue bus driving down▁the▁road next▁to a sidewalk.
[Q] [../assets/two_cats.jpg]What's this?

[A]  Two cats laying on▁top▁of a pink blanket.

Run LLaMA-2-7B

Download model from here, or using the following instructions

mkdir ../models && cd ../models
# Dowload llama-2-7b-chat-q4_k.mllm
wget https://huggingface.co/mllmTeam/llama-2-7b-mllm/resolve/main/llama-2-7b-chat-q4_k.mllm?download=true  -O llama-2-7b-chat-q4_k.mllm

Run on an android phone with at least 12GB of memory.

cd ../script
./run_llama.sh

Result are as followed:

> ./demo_llama
[Q]  Hello, who are you?
[A]

I am a bot, I'm here to help answer your questions and provide information. I'm just an AI, I don't have personal experiences or feelings like humans do, but I'm here to help you with any questions you may have. Is there something specific you would like to know or talk about?
[Q]  What can you do?
[A]
I can help with a wide range of tasks, including:
  
* Answering questions on various topics, such as science, history, technology, and more.
* Providing information on different subjects, such as culture, entertainment, and lifestyle.
* Helping with language-related tasks, such as translating text or improving grammar and syntax.
* Generating text, such as articles, stories, or even entire books.
* Assisting with coding
[Q]  Please introduce Beijing University of Posts and Telecommunications.
[A]
Beijing University of Posts and Telecommunications (BUPT) is a public research university located in Beijing, China. It was established in 1955 and is one of the top universities in China, with a strong focus on engineering, technology, and telecommunications.
BUPT offers a wide range of undergraduate and graduate programs in fields such as computer science, electrical engineering, software engineering, and communication engineering. The university

Run ImageBind

Download model from here, or using the following instructions

mkdir ../models && cd ../models
# Download imagebind_huge-q4_k.mllm
wget https://huggingface.co/mllmTeam/imagebind_huge-mllm/resolve/main/imagebind_huge-q4_k.mllm?download=true -O imagebind_huge-q4_k.mllm

Run on an android phone with at least 4GB of memory.

cd ../script
./run_imagebind.sh

Result are as followed:

> ./demo_imagebind 
vision X text :
0.9985647 0.0013827 0.0000526 
0.0000365 0.9998636 0.0000999 
0.0000115 0.0083149 0.9916736 
vision X audio :
0.8054272 0.1228001 0.0717727 
0.0673458 0.8429284 0.0897258 
0.0021967 0.0015335 0.9962698

Run for Linux

Build

cd scripts
./build.sh

Run Fuyu-8B

cd ./bin
./demo_fuyu -m ../models/fuyu-8b-q4_k.mllm -v ../vocab/fuyu_vocab.mllm

Run LLaMA-2-7B

cd ./bin
./demo_llama -m ../models/llama-2-7b-chat-q4_k.mllm -v ../vocab/llama2_vocab.mllm

Run ImageBind

cd ./bin
./demo_imagebind -m ../models/imagebind_huge-q4_k.mllm -v ../vocab/clip_vocab.mllm

Customization

Convert models

You can download models from here, or you can convert a pytorch/safetensor model to mllm model by yourself.

cd tools/convertor
pip install -r ./requirements.txt

# for one file pytorch model
python converter.py --input_model=model.pth --output_model=model.mllm --type=torch

# for multi-file pytorch model
python converter.py --input_model=pytorch_model.bin.index.json --output_model=model.mllm --type=torch

# for one file safetensor model
python converter.py --input_model=model.bin --output_model=model.mllm --type=safetensor

# for multi-file safetensor model
python converter.py --input_model=model.safetensors.index.json --output_model=model.mllm --type=safetensor

Convert vocabulary

You can convert vocabulary to mllm vocabulary as followed.

cd tools/convertor
python vocab.py --input_file=tokenizer.json --output_file=vocab.mllm --type=Unigram

Quantize models

You can quantize mllm model to int4 model by yourself. mllm only support two quantize modes: Q4_0 and Q4_K.

cd bin
./quantize model.mllm model_q4_k.mllm Q4_K

Roadmap

More backends like QNN
More models like PandaGPT
More optimizations like LUT-GEMM
More..

Documentation

See the documentation here for more information

Contribution

Read the contribution before you contribute.

Acknowledgments

mllm reuses many low-level kernel implementation from ggml on ARM CPU. It also utilizes stb and wenet for pre-processing images and audios. mllm also has benefitted from following projects: llama.cpp and MNN.

License

Overall Project License

This project is licensed under the terms of the MIT License. Please see the LICENSE file in the root directory for the full text of the MIT License.

Apache 2.0 Licensed Components

Certain component(wenet) of this project is licensed under the Apache License 2.0. These component is clearly identified in their respective subdirectories along with a copy of the Apache License 2.0. For the full text of the Apache License 2.0, please refer to the LICENSE-APACHE file located in the relevant subdirectories.

Citation

@article{xu2025fast,
  title={Fast On-device LLM Inference with NPUs},
  author={Xu, Daliang and Zhang, Hao and Yang, Liming and Liu, Ruiqi and Huang, Gang and Xu, Mengwei and Liu, Xuanzhe},
  booktitle={International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
  year={2025}
}
@misc{yi2023mllm,
  title = {mllm: fast and lightweight multimodal LLM inference engine for mobile and edge devices},
  author = {Rongjie Yi and Xiang Li and Zhenyan Lu and Hao Zhang and Daliang Xu and Liming Yang and Weikai Xie and Chenghua Wang and Xuanzhe Liu and Mengwei Xu},
  year = {2023},
  publisher = {mllm Team},
  url = {https://github.com/UbiquitousLearning/mllm}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,915 Commits
.github/workflows		.github/workflows
android @ e8dff3a		android @ e8dff3a
assets		assets
examples		examples
include		include
python		python
scripts		scripts
src		src
test		test
third_party		third_party
tools		tools
vocab		vocab
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clang-tidy.ignore		.clang-tidy.ignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py
workflow_dev.py		workflow_dev.py

License

UbiquitousLearning/mllm

Folders and files

Latest commit

History

Repository files navigation

mllm

fast and lightweight multimodal LLM inference engine for mobile and edge devices

| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |

Recent update

Contents

Android Demo

Support models

Language models

Multimodal models

Quick Start

Get the Code

Check prerequisites

Run Qwen with Hexagon NPU accelerating using QNN

Run with the CPU of Android

Build

Run Fuyu-8B

Run LLaMA-2-7B

Run ImageBind

Run for Linux

Build

Run Fuyu-8B

Run LLaMA-2-7B

Run ImageBind

Customization

Convert models

Convert vocabulary

Quantize models

Roadmap

Documentation

Contribution

Acknowledgments

License

Overall Project License

Apache 2.0 Licensed Components

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 17

Languages