- vllm: A high-throughput and memory-efficient inference and serving engine for LLMs. [link][paper]
- bitsandbytes: 8-bit CUDA functions for PyTorch. [link]
- GPTQ-for-LLaMa: 4 bits quantization of LLaMA using GPTQ. [link]
- TinyChatEngine: TinyChatEngine: On-Device LLM Inference Library. [link]
- LMOps: General technology for enabling AI capabilities w/ LLMs and MLLMs. [link]
- lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.. [link]
- fastllm: 纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行. [link]
- llmtools: 4-Bit Finetuning of Large Language Models on One Consumer GPU. [link]
- torchdistill: A coding-free framework built on PyTorch for reproducible deep learning studies. 🏆20 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.. [link][paper]
- gpt4all: open-source LLM chatbots that you can run anywhere. [link][paper]
- low_bit_llama: Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs. [link]
- exllama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.. [link]
- TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.. [link]