For an operator to be eligible for fusion, it must meet the following conditions:
- It has only one input, excluding
Constant
andinitializer
type tensors. - It has only one output.
- The first dimension of both input and output shapes is annotated with "batch_size".
Therefore, we must first perform a more accurate shape inference, i.e., symbolic shape infer
. Run the following command:
python ./tools/symbolic_shape_infer.py --input [input model path] --output [output model path]
-
Download the onnxruntime project from https://github.com/microsoft/onnxruntime and build it from source by executing the following commands:
git clone https://github.com/microsoft/onnxruntime.git cd onnxruntime git apply ./runtime/ort/changes.patches
-
Install the Python package:
pip install -e .
We have currently implemented custom CPU ops [Merge and Route] for onnxruntime.
In the ./example/micro
directory, you can find some files. Follow these instructions to test the functionality for microbenchmark:
cd example/micro
python generate.py
./convert.sh
python fuse.py --num 2
python fuse.py
python test_runtime.py
In the ./example/transformer
directory, follow these instructions to test the functionality. We use two decode layers of the LLaMA model and its LoRA variant as our test models:
cd example/transformer
python generate.py
./convert.sh
python fuse.py
python test_runtime.py
- Generalize input assumptions to handle multiple inputs
- Refactor the single Route Op into multiple specialized Route Ops.
- Fix height = 256 and width = 256 to obeserve the effect.