awesome-LLM-toolkit

Collect information about all the necessary tools and dataset required for building a high-quality customized LLM

Core Model/Training Techniques

FlashAttention: Github, Paper
ALiBi: Paper
Faster Inference
- FasterTransformer: Github
bitsandbytes: 8-bit optimizer, Github
PEFT: Github, LoRA Paper
- QLoRA: GPU memory efficient fine-tuning by training LoRA on quantized models
BLIP-2/LAVIS: Connect image encoder with LLM to enable multi-modal, Paper
RLHFs:
- DeepSpeed Chat, trlx, Colossal AI
- TRL

Name	Type	Language	License	Note
mc4	Raw Text	multilingual (100+)	ODC-By
bloom	Raw Text	multilingual (46)	Depend on data
wmt22	translation	multilingual	Depend on data
RedPajama	Raw Text	mostly EN	Apache 2.0
WuDaoCorpora	Raw Text	zh-CN		5TB
The Stack	Code
The Flan Collection	Instruction
openwebtext	Raw Text	EN	CC0 1.0	used to train GPT-2
self-instruct-seed	Instruction	EN	Apache 2.0
Stanford Alpaca	Instruction	EN	CC BY-NC 4.0
Alpaca Cleaned	Instruction	EN	CC BY-NC 4.0
ShareGPT Vicuna	Instruction	EN	🤐	Collected from sharegpt
evol_instruct_70k	Instruction	EN	CC BY-NC ?	Generated by Evol-Instruct
HH-RLHF	RLHF	EN	MIT
databricks-dolly-15k	Instruction	EN	CC BY-SA 3.0
GuanacoDataset	Instruction	EN, zh-CN, zh-TW, JA, DE	GPL 3.0	desgined for multilingual
dolly_hhrlhf	Instruction	EN	CC BY-SA 3.0	MosaicAI's filtered version of HH-RLHF and databricks-dolly-15k
HC3, Chinese	RLHF	EN, CN	CC BY-SA 3.0	Paper, Github
Lamini	Instruction	EN	CC-BY-4.0
the_pile_books3	Raw Text	mostly EN	MIT	part of the pile
CoNaLa	Coding	EN	MIT
GPTeacher	Instruction	EN	MIT	Generated by GPT-4
Alpaca-CoT	Instruction	EN, CN	Apache 2.0	Collection of instruction datasets
OpenAssistant/oasst1	Conversation	multilingual	Apache 2.0	Paper, Collected from Open Assistant

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
assets		assets
notebooks		notebooks
LMMs.md		LMMs.md
README.md		README.md
embeddings.md		embeddings.md
hp_recipes.md		hp_recipes.md