Skip to content

Commit debfce2

Browse files
authored
Add files via upload
0 parents  commit debfce2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+219553
-0
lines changed

LEGAL.md

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Legal Disclaimer
2+
3+
Within this source code, the comments in Chinese shall be the original, governing version. Any comment in other languages are for reference only. In the event of any conflict between the Chinese language version comments and other language version comments, the Chinese language version shall prevail.
4+
5+
法律免责声明
6+
7+
关于代码注释部分,中文注释为官方版本,其它语言注释仅做参考。中文注释可能与其它语言注释存在不一致,当中文注释与其它语言注释存在不一致时,请以中文注释为准。

README.md

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# CodeFuse-MFTCoder: Multitask Fine-Tuned Code LLMs
2+
3+
<p align="center">
4+
<img src="./assets/codefuse_logo_blue.png" width="100%" />
5+
</p>
6+
7+
8+
<div align="center">
9+
10+
<p>
11+
<a href="https://github.com/codefuse-ai/MFTCoder">
12+
<img alt="stars" src="https://img.shields.io/github/stars/codefuse-ai/mftcoder?style=social" />
13+
</a>
14+
<a href="https://github.com/codefuse-ai/MFTCoder">
15+
<img alt="forks" src="https://img.shields.io/github/forks/codefuse-ai/mftcoder?style=social" />
16+
</a>
17+
<a href="https://github.com/codefuse-ai/MFTCoder/LICENCE">
18+
<img alt="License: MIT" src="https://badgen.net/badge/license/apache2.0/blue" />
19+
</a>
20+
<a href="https://github.com/codefuse-ai/MFTCoder/releases">
21+
<img alt="Release Notes" src="https://img.shields.io/github/release/codefuse-ai/MFTCoder" />
22+
</a>
23+
<a href="https://github.com/codefuse-ai/MFTCoder/issues">
24+
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/MFTCoder" />
25+
</a>
26+
</p>
27+
28+
29+
[[中文]](README_cn.md) [**English**]
30+
31+
</div>
32+
33+
34+
35+
## Contents
36+
- [News](#News)
37+
- [Articles](#Articles)
38+
- [Introduction](#Introduction)
39+
- [Requirements](#Requirements)
40+
- [Training](#Training)
41+
- [Models](#Models)
42+
- [Datasets](#Datasets)
43+
44+
45+
## News
46+
🔥🔥🔥 [2023/09/07]We released **CodeFuse-CodeLlama-34B**, which achieves the **74.4 pass@1** (greedy decoding) and surpasses GPT4 (2023/03/15), ChatGPT-3.5, and Claude2 on the [HumanEval Benchmarks](https://github.com/openai/human-eval).
47+
48+
🔥 [2023/08/26]We released MFTCoder which supports finetuning CodeLlama, Llama, Llama2, StarCoder, Chatglm2, CodeGeex2, Qwen, and GPT-NEOX models with LoRA/QLoRA.
49+
50+
| Model | HumanEval(pass@1) |
51+
|:-----------------------------------|:-----------------:|
52+
| CodeLlama-34b | 48.8% |
53+
| CodeLlama-34b-Python | 53.7% |
54+
| WizardCoder-Python-34B-V1.0 | 73.2% |
55+
| **CodeFuse-CodeLlama-34B** | **74.4%** |
56+
57+
## Articles
58+
59+
60+
## Introduction
61+
**CodeFuse-MFTCoder** is an open-source project of CodeFuse for multitasking Code-LLMs(large language model for code tasks), which includes models, datasets, training codebases and inference guides.
62+
In MFTCoder, we released two codebases for finetuning Large Language Models:
63+
- ```mft_peft_hf``` is based on huggingface accelerate and deepspeed framework.
64+
- ```mft_atorch``` is based on [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover), which is a fast distributed training framework of LLM.
65+
The project aims to share and collaborate on advancements in large language models specifically in the domain of code.
66+
67+
### Frameworks
68+
![img.png](./assets/img.png)
69+
70+
### Highlights
71+
1. [x] **Multi-task**: It is able to train a model on multiple tasks, ensuring a balance between them and even generalizing to new, unseen tasks.
72+
2. [x] **Multi-model**: It supports various state-of-the-art open-source models, including gpt-neox, llama, llama-2, baichuan, Qwen, chatglm2, and more.
73+
3. [x] **Multi-framework**: It provides support for both HuggingFace Accelerate(deepspeed used) and [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover).
74+
4. [x] **Efficient fine-tuning**: It supports LoRA and QLoRA, enabling the fine-tuning of large models with minimal resources. The training speed is capable of meeting the demands of almost all fine-tuning scenarios.
75+
76+
The main content of this project includes:
77+
- Support for both SFT (Supervised FineTuning) and MFT (Multi-task FineTuning). The current MFTCoder has achieved data balance among multiple tasks, and future releases will realize balance of difficulty and convergence in traning process.
78+
- Support for QLoRA instruction fine-tuning, as well as LoRA fine-tuning.
79+
- Support for most mainstream open-source large models, specifically for potential Code-LLMs, such as Code-LLaMA, Starcoder, Codegeex2, Qwen, GPT-Neox and more.
80+
- Support for weight merging between LoRA adaptor and base models, making inference more convenient.
81+
- Release of 2 high-quality code-related instruction fine-tuning datasets: CodeFuse13B-evol-instruction-4K, CodeFuse-CodeExercise-Python-27k.
82+
- Release of 2 model weights in [CodeFuse series model weights](https://huggingface.co/codefuse-ai).
83+
84+
85+
## Requirements
86+
Firstly, you need to make sure you have installed CUDA(>=11.4, we have used 11.7) related, and torch(2.0.1) successfully.
87+
88+
Then we provide init_env.sh to install required packages:
89+
```bash
90+
sh init_env.sh
91+
```
92+
If you need flash attention, please install via reference https://github.com/Dao-AILab/flash-attention
93+
94+
95+
## Training
96+
🚀 [Huggingface accelerate + deepspeed Codebase for MFT(Multi-task Finetuning)](./mft_peft_hf/README.md)
97+
98+
🚀 [Atorch Codebase for MFT(Multi-task Finetuning)](./mft_atorch/README.md)
99+
100+
101+
## Models
102+
103+
We are releasing the 2 fowllowed CodeLLMs trianed by MFTCoder on Hugging Face.
104+
105+
| Model | Base Model | Num of examples trained | Batch Size | Seq Length | Licence |
106+
|----------------------------------------------------------------------|--------------------|-------------------------|------------|------------|-----|
107+
| [🔥🔥🔥 CodeFuse-CodeLlama-34B](https://huggingface.co/codefuse-ai/) | CodeLlama-34b-Python | 600k | 80 | 4096 | |
108+
| [🔥 CodeFuse-13B](https://huggingface.co/codefuse-ai/) | CodeFuse-13B | 66k | 64 | 4096 | |
109+
110+
111+
112+
## Datasets
113+
We are releasing the 2 fowllowed code-related instruction datasets on Hugging Face.
114+
115+
| Dataset | Introduction | Licence |
116+
|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
117+
| [⭐ Evol-instruction-66k](https://huggingface.co/datasets/) | Based on open-evol-instruction-80k, filter out low-quality, repeated, and similar instructions to HumanEval, thus get high-quality code instruction dataset. | |
118+
| [⭐ CodeExercise-Python-27k](https://huggingface.co/datasets/) | python code exercise instruction dataset generated by chatgpt | |
119+

README_cn.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# CodeFuse-MFTCoder: 多任务微调代码大模型
2+
3+
<p align="center">
4+
<img src="./assets/codefuse_logo_blue.png" width="80%" />
5+
</p>
6+
7+
8+
<div align="center">
9+
10+
<p>
11+
<a href="https://github.com/codefuse-ai/MFTCoder">
12+
<img alt="stars" src="https://img.shields.io/github/stars/codefuse-ai/mftcoder?style=social" />
13+
</a>
14+
<a href="https://github.com/codefuse-ai/MFTCoder">
15+
<img alt="forks" src="https://img.shields.io/github/forks/codefuse-ai/mftcoder?style=social" />
16+
</a>
17+
<a href="https://github.com/codefuse-ai/MFTCoder/LICENCE">
18+
<img alt="License: MIT" src="https://badgen.net/badge/license/apache2.0/blue" />
19+
</a>
20+
<a href="https://github.com/codefuse-ai/MFTCoder/releases">
21+
<img alt="Release Notes" src="https://img.shields.io/github/release/codefuse-ai/MFTCoder" />
22+
</a>
23+
<a href="https://github.com/codefuse-ai/MFTCoder/issues">
24+
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/MFTCoder" />
25+
</a>
26+
</p>
27+
28+
[**中文**] [[English]](README.md)
29+
30+
</div>
31+
32+
33+
34+
## Contents
35+
- [新闻](#新闻)
36+
- [文章](#文章)
37+
- [项目简介](#项目简介)
38+
- [环境](#环境)
39+
- [训练](#训练)
40+
- [模型](#模型)
41+
- [数据集](#数据集)
42+
43+
44+
## 新闻
45+
🔥🔥🔥 [2023/09/07]MFTCoder微调的模型**CodeFuse-CodeLlama-34B**[HumanEval Benchmarks](https://github.com/openai/human-eval)**pass@1** 取得了**74.4%**(greedy decoding)的开源SOTA成绩。
46+
47+
🔥 [2023/08/26]MFTCoder支持使用LoRA/QLoRA对CodeLlama、Chatglm2、Codegeex2、Llama、Llama2、starcoder、qwen和GPT-NEOX模型进行微调。
48+
49+
| 模型 | HumanEval(pass@1) |
50+
|:----------------------------|:-----------------:|
51+
| CodeLlama-34b | 48.8% |
52+
| CodeLlama-34b-Python | 53.7% |
53+
| WizardCoder-Python-34B-V1.0 | 73.2% |
54+
| **CodeFuse-CodeLlama-34B** | **74.4%** |
55+
56+
## 文章
57+
58+
59+
## 项目简介
60+
**Codefuse-MFTCoder** 是一个开源的多任务代码大语言模型项目,包含代码大模型的模型、数据、训练等。我们希望通过开源,分享交流大语言模型在代码领域的进步。
61+
62+
### 项目框架
63+
![img_1.png](./assets/img_1.png)
64+
65+
### 项目优势
66+
1. [x] **多任务**:一个模型同时支持多个任务,会保证多个任务之间的平衡,甚至可以泛化到新的没有见过的任务上去;
67+
2. [x] **多模型**:支持最新的多个开源模型,包括gpt-neox,llama,llama-2,baichuan,Qwen,chatglm2等;
68+
3. [x] **多框架**:同时支持HuggingFace 和 [ATorch 框架](https://github.com/intelligent-machine-learning/dlrover)
69+
4. [x] **高效微调**:支持LoRA和QLoRA,可以用很少的资源去微调很大的模型,且训练速度能满足几乎所有微调场景;
70+
71+
72+
本项目主要内容如下:
73+
- 同时支持单任务SFT(Supervised FineTuning)和MFT(Multi-task FineTuning), 当前开源支持数据均衡,未来将持续开源难易均衡, 收敛均衡等
74+
- 支持QLoRA低成本高效指令微调、LoRA高效指令微调。
75+
- 支持绝大部分主流的开源大模型,重点关注代码能力优秀的开源大模型,如Qwen, GPT-Neox, Starcoder, Codegeex2, Code-LLaMA等。
76+
- 支持lora与base model进行权重合并,推理更便捷。
77+
- 整理并开源指令微调数据集:CodeFuse13B-evol-instruction-4K;CodeFuse-CodeExercise-Python-27k。
78+
- 开源[Codefuse系列指令微调模型权重](https://huggingface.co/codefuse-ai)
79+
80+
81+
82+
## 环境
83+
首先, 你需要将CUDA(>=11.4, 推荐11.7)及其相关驱动安装成功,并确保其工作正常, 并且安装基本的torch(>=2.0.0)
84+
在requirements.txt下固定了几个主要的python包的版本,执行如下脚本即可:
85+
```bash
86+
sh init_env.sh
87+
```
88+
如果希望使用flash attention, 安装请参考 https://github.com/Dao-AILab/flash-attention
89+
90+
## 训练
91+
🚀 [Huggingface accelerate + deepspeed Codebase for MFT(Multi-task Finetuning)](./mft_peft_hf/README.md)
92+
93+
🚀 [Atorch Codebase for MFT(Multi-task Finetuning)](./mft_atorch/README.md)
94+
95+
96+
## 模型
97+
98+
使用本项目的训练代码,以及上述训练数据,我们训练并在huggingface开源了以下模型。
99+
100+
| 模型 | 基座模型 | 训练数据 | Batch Size | Seq Length |
101+
|------------------------------------------------------------------|----------------------|------|------------|------------|
102+
| [🔥🔥🔥 CodeFuse-CodeLlama-34B](https://huggingface.co/codefuse-ai/) | CodeLlama-34b-Python | 60万 | 80 | 4096 |
103+
| [🔥 CodeFuse-13B](https://huggingface.co/codefuse-ai/) | CodeFuse-13B-Base | 6.6万 | 64 | 4096 |
104+
105+
106+
107+
## 数据集
108+
目前本项目主要整理了如下指令数据集,并将其整理成统一的数据格式:
109+
110+
| 数据集 | 介绍 |
111+
|------------------------------------------------------------------------|--------------------------------------------------------------------|
112+
| [⭐ Evol-instruction-66k](https://huggingface.co/datasets/) | 基于开源open-evol-instruction-80k过滤低质量,重复和human eval相似的数据后得到的高质量代码类微调数据 |
113+
| [⭐ CodeExercise-Python-27k](https://huggingface.co/datasets/) | 基于chatgpt生成的高质量python练习题数据 |
114+
115+
116+

assets/codefuse_logo_blue.png

51.5 KB
Loading

assets/img.png

134 KB
Loading

assets/img_1.png

130 KB
Loading

init_env.sh

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pip install torch==2.0.1 && \
2+
pip install tensorboard && \
3+
pip install packaging && \
4+
pip install -r requirements.txt

0 commit comments

Comments
 (0)