Skip to content

y-lan/awesome-LLM-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-LLM-toolkit

Collect information about all the necessary tools and dataset required for building a high-quality customized LLM

Core Model/Training Techniques

Data

Name Type Language License Note
mc4 Raw Text multilingual (100+) ODC-By
bloom Raw Text multilingual (46) Depend on data
wmt22 translation multilingual Depend on data
RedPajama Raw Text mostly EN Apache 2.0
WuDaoCorpora Raw Text zh-CN 5TB
The Stack Code
The Flan Collection Instruction
openwebtext Raw Text EN CC0 1.0 used to train GPT-2
self-instruct-seed Instruction EN Apache 2.0
Stanford Alpaca Instruction EN CC BY-NC 4.0
Alpaca Cleaned Instruction EN CC BY-NC 4.0
ShareGPT Vicuna Instruction EN 🤐 Collected from sharegpt
evol_instruct_70k Instruction EN CC BY-NC ? Generated by Evol-Instruct
HH-RLHF RLHF EN MIT
databricks-dolly-15k Instruction EN CC BY-SA 3.0
GuanacoDataset Instruction EN, zh-CN, zh-TW, JA, DE GPL 3.0 desgined for multilingual
dolly_hhrlhf Instruction EN CC BY-SA 3.0 MosaicAI's filtered version of HH-RLHF and databricks-dolly-15k
HC3, Chinese RLHF EN, CN CC BY-SA 3.0 Paper, Github
Lamini Instruction EN CC-BY-4.0
the_pile_books3 Raw Text mostly EN MIT part of the pile
CoNaLa Coding EN MIT
GPTeacher Instruction EN MIT Generated by GPT-4
Alpaca-CoT Instruction EN, CN Apache 2.0 Collection of instruction datasets
OpenAssistant/oasst1 Conversation multilingual Apache 2.0 Paper, Collected from Open Assistant

Others

Evaluation

Serving

Community

Others

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published