Collect information about all the necessary tools and dataset required for building a high-quality customized LLM
- FlashAttention: Github, Paper
- ALiBi: Paper
- Faster Inference
- FasterTransformer: Github
- bitsandbytes: 8-bit optimizer, Github
- PEFT: Github, LoRA Paper
- QLoRA: GPU memory efficient fine-tuning by training LoRA on quantized models
- BLIP-2/LAVIS: Connect image encoder with LLM to enable multi-modal, Paper
- RLHFs:
Name | Type | Language | License | Note |
---|---|---|---|---|
mc4 | Raw Text | multilingual (100+) | ODC-By | |
bloom | Raw Text | multilingual (46) | Depend on data | |
wmt22 | translation | multilingual | Depend on data | |
RedPajama | Raw Text | mostly EN | Apache 2.0 | |
WuDaoCorpora | Raw Text | zh-CN | 5TB | |
The Stack | Code | |||
The Flan Collection | Instruction | |||
openwebtext | Raw Text | EN | CC0 1.0 | used to train GPT-2 |
self-instruct-seed | Instruction | EN | Apache 2.0 | |
Stanford Alpaca | Instruction | EN | CC BY-NC 4.0 | |
Alpaca Cleaned | Instruction | EN | CC BY-NC 4.0 | |
ShareGPT Vicuna | Instruction | EN | 🤐 | Collected from sharegpt |
evol_instruct_70k | Instruction | EN | CC BY-NC ? | Generated by Evol-Instruct |
HH-RLHF | RLHF | EN | MIT | |
databricks-dolly-15k | Instruction | EN | CC BY-SA 3.0 | |
GuanacoDataset | Instruction | EN, zh-CN, zh-TW, JA, DE | GPL 3.0 | desgined for multilingual |
dolly_hhrlhf | Instruction | EN | CC BY-SA 3.0 | MosaicAI's filtered version of HH-RLHF and databricks-dolly-15k |
HC3, Chinese | RLHF | EN, CN | CC BY-SA 3.0 | Paper, Github |
Lamini | Instruction | EN | CC-BY-4.0 | |
the_pile_books3 | Raw Text | mostly EN | MIT | part of the pile |
CoNaLa | Coding | EN | MIT | |
GPTeacher | Instruction | EN | MIT | Generated by GPT-4 |
Alpaca-CoT | Instruction | EN, CN | Apache 2.0 | Collection of instruction datasets |
OpenAssistant/oasst1 | Conversation | multilingual | Apache 2.0 | Paper, Collected from Open Assistant |
- Stack Exchange
- wikihow
- Pushshift Reddit API
- awesome-legal-nlp
- Papers on data-selection / instruction fine-tuning
- huggingface/text-generation-inference: production LLM inference server
- LMSYS Arena & Leaderboard
- text-generation-webui
- LlamaChat MacOS Client
- llama.cpp