New llama.cpp build - 0.0.3989 (upstream tag: b3989)

This is a major update to our llama.cpp package and build, and includes the addition of the llama.cpp-tools and gguf packages to the feedstock.

Feedstock: https://github.com/AnacondaRecipes/llama.cpp-feedstock Upstream: https://github.com/ggerganov/llama.cpp

High level changes from 0.0.3747

Fixed the windows build scripts to work with the new GGML style arguments, add in missing flags to windows cmake build.
Moved from CMAKE_CUDA_ARCHITECTURES=all to CMAKE_CUDA_ARCHITECTURES=all-major to cut down on size and build time
- The all target builds micro versions in addition to major releases and does not increase package compatibility and is unnecessary.
Moved to using github tags instead of build tarballs, this allows the feedstock to more easily align with upstream releases and also allows the git-dependent make logic in upstream to work correctly for version number injection.
Added v2 (AVX) builds for linux and windows to improve our compatibility and performance story.
- This is important for windows, as AVX2 can imply f16c support, and f16c is not universally supported on all chipsets that support AVX2, so for machines that do not support f16c, we need to potentially fall back to the AVX (v2) build.
Moved the feedstock to a meta-package build with multiple outputs, this means the feedstock now emits llama.cpp, llama.cpp-tools and the gguf packages.
- All 3 of the packages come from the same upstream source/tag, and this is more inline with the intended design of llama.cpp as a single monorepo that has minimal external dependencies.

Packages included in llama.cpp feedstock (llama.cpp-meta)

llama.cpp is the main package for the core binaries that ship with llama.cpp including:

Binary	Description
`llama-batched[.exe]`	Batched inference interface
`llama-batched-bench[.exe]`	Batched inference benchmarking tool
`llama-bench[.exe]`	Benchmarking interface
`llama-cli[.exe]`	Command line interface
`llama-convert-llama2c[.exe]`	LLAMA2 to GGML conversion tool
`llama-cvector-generator[.exe]`	C vector generation tool
`llama-embedding[.exe]`	Embedding generation interface
`llama-eval-callback[.exe]`	Evaluation callback testing tool
`llama-export-lora[.exe]`	LoRA export tool
`llama-gbnf-validator[.exe]`	GBNF grammar validation tool
`llama-gen-docs[.exe]`	Documentation generation tool
`llama-gguf[.exe]`	GGUF conversion interface
`llama-gguf-hash[.exe]`	GGUF hash generation tool
`llama-gguf-split[.exe]`	GGUF file splitting tool
`llama-gritlm[.exe]`	GritLM interface
`llama-imatrix[.exe]`	Matrix operation tool
`llama-infill[.exe]`	Text infilling interface
`llama-llava-cli[.exe]`	LLAVA command line interface
`llama-lookahead[.exe]`	Lookahead inference tool
`llama-lookup[.exe]`	Lookup table interface
`llama-lookup-create[.exe]`	Lookup table creation tool
`llama-lookup-merge[.exe]`	Lookup table merging tool
`llama-lookup-stats[.exe]`	Lookup table statistics tool
`llama-minicpmv-cli[.exe]`	MiniCPM-V command line interface
`llama-parallel[.exe]`	Parallel inference interface
`llama-passkey[.exe]`	Passkey generation tool
`llama-perplexity[.exe]`	Perplexity calculation tool
`llama-q8dot[.exe]`	Q8 dot product calculation tool
`llama-quantize[.exe]`	Model quantization interface
`llama-quantize-stats[.exe]`	Quantization statistics tool
`llama-retrieval[.exe]`	Retrieval interface
`llama-save-load-state[.exe]`	State saving and loading tool
`llama-server[.exe]`	Server interface
`llama-simple[.exe]`	Simple inference interface
`llama-simple-chat[.exe]`	Simple chat interface
`llama-speculative[.exe]`	Speculative inference tool
`llama-tokenize[.exe]`	Tokenization tool
`llama-vdot[.exe]`	Vector dot product calculation tool

llama.cpp-tools is a package that includes useful scripts and model conversion tools that ship with llama.cpp including:

Script	Description
`llama-convert-hf-to-gguf`	Convert from Hugging Face safetensors to GGUF
`llama-convert-llama-ggml-to-gguf`	Convert from GGML to GGUF
`llama-convert-lora-to-gguf`	Convert from LoRA to GGUF
`llama-lava-surgery`	LLaVA surgery tool
`llama-lava-surgery_v2`	LLaVA surgery tool v2
`llama-convert-image-encoder-to-gguf`	Convert from image encoder to GGUF

gguf is a python package for writing binary files in the GGUF (GGML Universal File) format and includes the following tools:

Script	Description
`gguf-convert-endian`	Convert the endianness of a GGUF file
`gguf-dump`	Dump the contents of a GGUF file
`gguf-set-metadata`	Set the metadata of a GGUF file
`gguf-new-metadata`	Create a new metadata object for a GGUF file

Note: The llama.cpp-tools and gguf packages are Python so there are not CUDA-enabled or hardware optimized builds.

Package	Minimum Python Version	Spec
`llama.cpp-tools`	3.9	llama.cpp-tools==0.0.3989*
`gguf`	3.9	gguf==0.10.0*

llama.cpp builds available

For llama.cpp, the following build types are available - this includes both hardware optimized CPU, GPU and MPS builds.

Note: GPU-enabled builds depend on CUDA 12.4 and will automatically install the CUDA toolkit from the defaults channel.

MacOS:

Build Type	Architecture	Spec	Notes
cpu_v1_accelerate	x86-64	llama.cpp==0.0.3989=cpu_v1_accelerate*	Pre-M1/M2 (arm64) Macs
mps	Apple MPS (arm64)	llama.cpp==0.0.3989=mps*	M1/M2+ Macs

For Mac OS, the metal-enabled mps build is recommended for all users running M1, M2 or later Macs.
For pre-M1/M2 Macs, the cpu_v1_accelerate build is recommended.

Example install command for the metal-enabled mps build:

conda install -c ai-staging llama.cpp=0.0.3989=mps*

Linux-64:

Build Type	Architecture	Spec
cpu_v1_mkl	Intel MKL, SSE2	llama.cpp==0.0.3989=cpu_v1_mkl*
cpu_v1_openblas	OpenBLAS, SSE2	llama.cpp==0.0.3989=cpu_v1_openblas*
cpu_v2_mkl	Intel MKL, AVX	llama.cpp==0.0.3989=cpu_v2_mkl*
cpu_v2_openblas	OpenBLAS, AVX	llama.cpp==0.0.3989=cpu_v2_openblas*
cpu_v3_mkl	Intel MKL, AVX2	llama.cpp==0.0.3989=cpu_v3_mkl*
cpu_v3_openblas	OpenBLAS, AVX2	llama.cpp==0.0.3989=cpu_v3_openblas*
cuda124_v1	CUDA 12.4, SSE2	llama.cpp==0.0.3989=cuda124_v1*
cuda124_v2	CUDA 12.4, AVX	llama.cpp==0.0.3989=cuda124_v2*
cuda124_v3	CUDA 12.4, AVX2	llama.cpp==0.0.3989=cuda124_v3*

The cpu_v* builds are optimized for different CPU architectures (SSE2, AVX, AVX2) and use the Intel MKL or OpenBLAS libraries.
The cuda* builds are optimized for the NVIDIA GPUs and use CUDA 12.4.

Determining which cpu-optimized build to use on linux

To determine which cpu-optimized build to use on linux, you can use the following command to show the CPU architecture flags:

lscpu | grep -Eo 'sse2|avx2|avx512[^ ]*|avx\s'

You should see output like the following:

# lscpu | grep -Eo 'sse2|avx2|avx512[^ ]*|avx\s'
sse2
avx
avx2
avx512f
avx512dq
avx512cd
avx512bw
avx512vl
avx512_vnni

If you see sse2 in the output, but not avx, you should use the cpu_v1_mkl or cpu_v1_openblas variant.
If you see avx in the output, you should use the cpu_v2_mkl or cpu_v2_openblas variant.
If you see avx2 in the output, you should use the cpu_v3_mkl or cpu_v3_openblas variant.

Note: The _openblas builds may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations does not affect normal generation performance, do not use these builds unless you are optimizing / benchmarking prompt processing performance.

Determining which gpu-optimized build to use on linux

If you are using an NVIDIA GPU supported by CUDA 12.4, you should use the cuda variants, you can confirm / view your GPU information using the nvidia-smi command if you already have the CUDA drivers/utilities installed:

[root@localhost ~]# nvidia-smi
Thu Nov 14 17:00:39 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0             27W /   70W |       0MiB /  15360MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

If you do not have the CUDA drivers / utilities installed, you can check your GPU by looking at the lspci output:

# lspci | grep -i nvidia
0000:00:1e.0: NVIDIA Corporation Device 1eb0 (rev a1)

For this Linux machine, the GPU is a Tesla T4 and the CUDA version is 12.4, it supports SSE2, AVX, AVX2, and AVX512, therefore the we would install the cuda124_v3 variant to enable both GPU acceleration and AVX2 instructions.

Example install command for the cuda124_v3 build:

conda install -c ai-staging llama.cpp=0.0.3989=cuda124_v3*

Windows-64:

Build Type	Architecture	Spec
cpu_v1_mkl	Intel MKL, SSE2	llama.cpp==0.0.3989=cpu_v1_mkl*
cpu_v2_mkl	Intel MKL, AVX	llama.cpp==0.0.3989=cpu_v2_mkl*
cpu_v3_mkl	Intel MKL, AVX2	llama.cpp==0.0.3989=cpu_v3_mkl*
cuda124_v1	CUDA 12.4, SSE2	llama.cpp==0.0.3989=cuda124_v1*
cuda124_v2	CUDA 12.4, AVX	llama.cpp==0.0.3989=cuda124_v2*
cuda124_v3	CUDA 12.4, AVX2	llama.cpp==0.0.3989=cuda124_v3*

Determining which cpu-optimized build to use on Windows

To determine which cpu-optimized build to use on Windows, you can use the following command to show the CPU architecture flags on the command line:

conda install py-cpuinfo
python -c "import cpuinfo; flags=cpuinfo.get_cpu_info()['flags']; print('\n'.join(f for f in flags if f in ['sse2', 'avx', 'avx2'] or f.startswith('avx512')))"

You should see output like the following:

python -c "import cpuinfo; flags=cpuinfo.get_cpu_info()['flags']; print('\n'.join(f for f in flags if f in ['sse2', 'avx', 'avx2'] or f.startswith('avx512')))"
avx
avx2
avx512bw
avx512cd
avx512dq
avx512f
avx512vl
avx512vnni
sse2

If you see sse2 in the output, but not avx, you should use the cpu_v1 variant.
If you see avx but not avx2 in the output, you should use the cpu_v2 variant.
If you see avx2 in the output, you should use the cpu_v3 variant.

Determining which gpu-optimized build to use on Windows

If you are using an NVIDIA GPU supported by CUDA 12.4, you should use the cuda124_v* variants, you can confirm / view your GPU information using the nvidia-smi.exe command if you already have the CUDA drivers/utilities installed or using wmic:

wmic path win32_VideoController get name

Example output:

wmic path win32_VideoController get name
Name
Microsoft Basic Display Adapter
NVIDIA Tesla T4

For this Windows machine, the GPU is a NVIDIA Tesla T4 and the CUDA version is 12.4, it supports SSE2, AVX, AVX2, and AVX512, therefore the we would install the cuda124_v3 variant to enable both GPU acceleration and AVX2 instructions.

Example install command for the cuda124_v3 build:

conda install -c ai-staging llama.cpp=0.0.3989=cuda124_v3*

Why not use the upstream package(s) directly?

Within llama.cpp upstream there are several pyproject.toml files:

llama.cpp/
  pyproject.toml
  gguf-py/
    pyproject.toml

These packages are not well-mainted within upstream:

The PyPI version of gguf is frequently out of date with vs what is in the llama.cpp repo.
Changes to the underlying gguf format exposed by the gguf package are not always reflected in the pyproject.toml file as a new version.
llama.cpp-tools has a hard dependency on the gguf package and they must be pinned/built and updated together.
- The version of the tools and gguf packages determine what AI models we can support across our portfolio, so we must ensure they are in lock-step from the same release tag.
The main pyproject.toml defines a llama.cpp-scripts package that is out of date vs the rest of the repository.
- It is missing dependencies and the ranges for the dependencies are not aligned with what is needed.
- It omits several useful scripts that are present in the repository.

llama.cpp has evolved over time to more of a monorepo structure trying to minimize the entry points into the project, setting up environments and resolving dependencies. The recommended way to install the python tooling is to:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

The pip install -r requirements.txt then installs all requirements listed in the /requirements/ directory including:

requirements-all.txt
requirements-compare-llama-bench.txt
requirements-convert_hf_to_gguf.txt
requirements-convert_hf_to_gguf_update.txt
requirements-convert_legacy_llama.txt
requirements-convert_llama_ggml_to_gguf.txt
requirements-convert_lora_to_gguf.txt
requirements-pydantic.txt
requirements-test-tokenizer-random.txt

This installs all dependencies for all python code within the llama.cpp repo including gguf, the conversion tools and everything in the /examples and tests directories.

In order to work around this, we have added the llama.cpp-tools and gguf packages to the feedstock which include the necessary dependencies for the python tooling and the gguf package, the llama.cpp-tools package ignores the upstream pyproject.toml and defines a corrected -tools package that includes the conversion tools and pins to the verion of the gguf package built as part of the feedstock to ensure they're in sync.

As other tools and scripts are added to the repo, the current approach will allow us to add them ad-hoc to the llama.cpp-tools package.

TODO

CHORE: Update to the latest upstream tag and build - more breaking changes came in after b3989.
BUG: Identify and root cause why the upstream make logic that injects the version number from the git repo is not matching the tag that conda-build is using to check out the source.
- For example, the build should be 3989 but for some reason the make logic is setting it to 3991.
DOCS: Add formal CHANGELOG.md and collapse the existing release notes into it.
DOCS: Add /docs/ directory to the feedstock to document the packages, processes and thinking
DOCS: Update the README.md; note that this no longer reflects the conda-forge version(s)
FEATURE: Add in the v4 AVX512 builds for linux and windows (may restrict build hosts to only include AVX512 capable machines).
- On AMD chips AVX512 performance is suppose to be way better than Intel / AVX512.
FEATURE: Add /server/bench and /server/html assets to the llama.cpp-tools package (but this would depend on llama.cpp)
- Where do html and other assets belong, especially if the server binary is in llama.cpp?
INVESTIGATE: Patch upstream pyproject.toml files to align them with the rest of the repository and add the missing scripts/tools.
- This would allow us to remove the llama.cpp-tools package from the feedstock and just execute the the install via pip
- We would have to regenerate the patch for any new tool/script we wanted to add to -tools|-scripts
- It would need to pull in several directories in the /examples directory.
- May be able to extend this to add in the html / server assets and /examples/server/bench/... assets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-build-3989.md

llama-build-3989.md

New llama.cpp build - 0.0.3989 (upstream tag: b3989)

High level changes from 0.0.3747

Packages included in llama.cpp feedstock (llama.cpp-meta)

llama.cpp builds available

MacOS:

Linux-64:

Determining which cpu-optimized build to use on linux

Determining which gpu-optimized build to use on linux

Windows-64:

Determining which cpu-optimized build to use on Windows

Determining which gpu-optimized build to use on Windows

Why not use the upstream package(s) directly?

TODO

Files

llama-build-3989.md

Latest commit

History

llama-build-3989.md

File metadata and controls

New llama.cpp build - 0.0.3989 (upstream tag: b3989)

High level changes from 0.0.3747

Packages included in llama.cpp feedstock (llama.cpp-meta)

llama.cpp builds available

MacOS:

Linux-64:

Determining which cpu-optimized build to use on linux

Determining which gpu-optimized build to use on linux

Windows-64:

Determining which cpu-optimized build to use on Windows

Determining which gpu-optimized build to use on Windows

Why not use the upstream package(s) directly?

TODO