Skip to content

Commit b2e89a3

Browse files
authored
Arm AArch64: Documentation updates (ggml-org#9321)
* Arm AArch64: Documentation updates * Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels * Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats * Add newline to the end of docs/build.md
1 parent daa9623 commit b2e89a3

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

docs/build.md

+6
Original file line numberDiff line numberDiff line change
@@ -380,3 +380,9 @@ For detailed info, such as model/device supports, CANN install, please refer to
380380
### Android
381381
382382
To read documentation for how to build on Android, [click here](./android.md)
383+
384+
### Arm CPU optimized mulmat kernels
385+
386+
Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
387+
388+
To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).

examples/quantize/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ As the models are currently fully loaded into memory, you will need adequate dis
5454

5555
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
5656

57+
The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.
58+
5759
*(outdated)*
5860

5961
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |

0 commit comments

Comments
 (0)