ZEnfrence

Toy project to learn about how Transformer LLMs work for inference.

Usage

Install the requirements:

poetry install

Download the model weights (Llama-2-7b-chat-hf is the only one supported at the moment) using the Hugging Face CLI:

huggingface-cli download \
  meta-llama/Llama-2-7b-chat-hf \
  --include "*.safetensors" \
  --local-dir . \
  --local-dir-use-symlinks False

Run inference:

poetry run python3 main.py inference --prompt "What is the meaning of life?"

This will be super slow, it is in no way optimized. Takes around 30 seconds/token on my machine (M1 MacBook Pro).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
main-doof.py		main-doof.py
main-mlx.py		main-mlx.py
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
weight_index.json		weight_index.json