Why Is bitnet.cpp Not Running My 1.58-Bit Model at the Expected Speed on CPU? #182
-
I’m using bitnet.cpp to run inference with a 1.58-bit LLM (BitNet-b1.58-2B-4T) on my x86 CPU. I followed the setup instructions and ran the model using run_inference.py, but I’m getting much slower speeds than reported—barely 2 tokens per second instead of the 5–7 tokens/sec mentioned in the documentation. I’ve already tried adjusting thread count with the --threads flag. Could this be related to missing kernel optimizations, quant type mismatches, or CPU compatibility issues? How do I troubleshoot and get the advertised performance? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If you're seeing slower-than-expected inference speeds, it’s likely due to either kernel mismatch or missing optimizations. Make sure you're using the correct quantization type (--quant-type i2_s or tl1) that matches the model you downloaded. Also verify that you're running the latest release of bitnet.cpp with the appropriate pre-tuned kernel parameters (--use-pretuned), as these significantly affect performance. On x86 CPUs, only certain models support the fastest kernel (TL2), so check the kernel compatibility table in the README. Lastly, ensure your CPU supports AVX2 or AVX512 instructions—performance drops sharply without them. Rebuilding with Clang 18+ and running with 4+ threads usually gives near-advertised speeds. |
Beta Was this translation helpful? Give feedback.
If you're seeing slower-than-expected inference speeds, it’s likely due to either kernel mismatch or missing optimizations. Make sure you're using the correct quantization type (--quant-type i2_s or tl1) that matches the model you downloaded. Also verify that you're running the latest release of bitnet.cpp with the appropriate pre-tuned kernel parameters (--use-pretuned), as these significantly affect performance. On x86 CPUs, only certain models support the fastest kernel (TL2), so check the kernel compatibility table in the README. Lastly, ensure your CPU supports AVX2 or AVX512 instructions—performance drops sharply without them. Rebuilding with Clang 18+ and running with 4+ threads us…