Skip to content

CPU Support

AlpinDale edited this page May 27, 2024 · 3 revisions

Aphrodite supports CPU-only inference at relatively fast speeds. Currently, only AVX512 CPUs are supported. You can verify this by running the following in a terminal:

cat /proc/cpuinfo | grep avx512

If your CPU does not support AVX512 instructions, the command will not output anything.

Building

  1. Install system-wide dependencies
$ sudo apt-get update -y
$ sudo apt-get install -y gcc-12 g++-12  # you can skip this if you already have a gcc/g++>=12.3.0 installed
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
  1. Install the python dependencies
$ pip install -U pip
$ pip install wheel packaging ninja setuptools>=49.4.0 numpy
$ pip install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
  1. Build Aphrodite Engine
APHRODITE_TARGET_DEVICE=cpu python setup.py install

Usage

You can run the engine as normal. There are some points you will need to note:

  1. Use the environment variable APHRODITE_CPU_KVCACHE_SPACE to specify the amount of memory (in GiBs) allocated for the KV cache. Higher numbers allow a higher degree of parallelism.

  2. The CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it'll be critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service even-loop) to avoid CPU oversubscription.

  3. If running on bare-metal, you should probably disable hyper-threading.

  4. If you're on a multi-socket machine with NUMA, make sure the process uses only a single socket to avoid remote memory access. You can use numactl to do this.

Clone this wiki locally