|
| 1 | +# CodeFuse-MFTCoder: Multitask Fine-Tuned Code LLMs |
| 2 | + |
| 3 | +<p align="center"> |
| 4 | + <img src="./assets/codefuse_logo_blue.png" width="100%" /> |
| 5 | +</p> |
| 6 | + |
| 7 | + |
| 8 | +<div align="center"> |
| 9 | + |
| 10 | +<p> |
| 11 | + <a href="https://github.com/codefuse-ai/MFTCoder"> |
| 12 | + <img alt="stars" src="https://img.shields.io/github/stars/codefuse-ai/mftcoder?style=social" /> |
| 13 | + </a> |
| 14 | + <a href="https://github.com/codefuse-ai/MFTCoder"> |
| 15 | + <img alt="forks" src="https://img.shields.io/github/forks/codefuse-ai/mftcoder?style=social" /> |
| 16 | + </a> |
| 17 | + <a href="https://github.com/codefuse-ai/MFTCoder/LICENCE"> |
| 18 | + <img alt="License: MIT" src="https://badgen.net/badge/license/apache2.0/blue" /> |
| 19 | + </a> |
| 20 | + <a href="https://github.com/codefuse-ai/MFTCoder/releases"> |
| 21 | + <img alt="Release Notes" src="https://img.shields.io/github/release/codefuse-ai/MFTCoder" /> |
| 22 | + </a> |
| 23 | + <a href="https://github.com/codefuse-ai/MFTCoder/issues"> |
| 24 | + <img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/MFTCoder" /> |
| 25 | + </a> |
| 26 | +</p> |
| 27 | + |
| 28 | + |
| 29 | +[[中文]](README_cn.md) [**English**] |
| 30 | + |
| 31 | +</div> |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | +## Contents |
| 36 | +- [News](#News) |
| 37 | +- [Articles](#Articles) |
| 38 | +- [Introduction](#Introduction) |
| 39 | +- [Requirements](#Requirements) |
| 40 | +- [Training](#Training) |
| 41 | +- [Models](#Models) |
| 42 | +- [Datasets](#Datasets) |
| 43 | + |
| 44 | + |
| 45 | +## News |
| 46 | +🔥🔥🔥 [2023/09/07]We released **CodeFuse-CodeLlama-34B**, which achieves the **74.4 pass@1** (greedy decoding) and surpasses GPT4 (2023/03/15), ChatGPT-3.5, and Claude2 on the [HumanEval Benchmarks](https://github.com/openai/human-eval). |
| 47 | + |
| 48 | +🔥 [2023/08/26]We released MFTCoder which supports finetuning CodeLlama, Llama, Llama2, StarCoder, Chatglm2, CodeGeex2, Qwen, and GPT-NEOX models with LoRA/QLoRA. |
| 49 | + |
| 50 | +| Model | HumanEval(pass@1) | |
| 51 | +|:-----------------------------------|:-----------------:| |
| 52 | +| CodeLlama-34b | 48.8% | |
| 53 | +| CodeLlama-34b-Python | 53.7% | |
| 54 | +| WizardCoder-Python-34B-V1.0 | 73.2% | |
| 55 | +| **CodeFuse-CodeLlama-34B** | **74.4%** | |
| 56 | + |
| 57 | +## Articles |
| 58 | + |
| 59 | + |
| 60 | +## Introduction |
| 61 | +**CodeFuse-MFTCoder** is an open-source project of CodeFuse for multitasking Code-LLMs(large language model for code tasks), which includes models, datasets, training codebases and inference guides. |
| 62 | +In MFTCoder, we released two codebases for finetuning Large Language Models: |
| 63 | +- ```mft_peft_hf``` is based on huggingface accelerate and deepspeed framework. |
| 64 | +- ```mft_atorch``` is based on [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover), which is a fast distributed training framework of LLM. |
| 65 | +The project aims to share and collaborate on advancements in large language models specifically in the domain of code. |
| 66 | + |
| 67 | +### Frameworks |
| 68 | + |
| 69 | + |
| 70 | +### Highlights |
| 71 | +1. [x] **Multi-task**: It is able to train a model on multiple tasks, ensuring a balance between them and even generalizing to new, unseen tasks. |
| 72 | +2. [x] **Multi-model**: It supports various state-of-the-art open-source models, including gpt-neox, llama, llama-2, baichuan, Qwen, chatglm2, and more. |
| 73 | +3. [x] **Multi-framework**: It provides support for both HuggingFace Accelerate(deepspeed used) and [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover). |
| 74 | +4. [x] **Efficient fine-tuning**: It supports LoRA and QLoRA, enabling the fine-tuning of large models with minimal resources. The training speed is capable of meeting the demands of almost all fine-tuning scenarios. |
| 75 | + |
| 76 | +The main content of this project includes: |
| 77 | +- Support for both SFT (Supervised FineTuning) and MFT (Multi-task FineTuning). The current MFTCoder has achieved data balance among multiple tasks, and future releases will realize balance of difficulty and convergence in traning process. |
| 78 | +- Support for QLoRA instruction fine-tuning, as well as LoRA fine-tuning. |
| 79 | +- Support for most mainstream open-source large models, specifically for potential Code-LLMs, such as Code-LLaMA, Starcoder, Codegeex2, Qwen, GPT-Neox and more. |
| 80 | +- Support for weight merging between LoRA adaptor and base models, making inference more convenient. |
| 81 | +- Release of 2 high-quality code-related instruction fine-tuning datasets: CodeFuse13B-evol-instruction-4K, CodeFuse-CodeExercise-Python-27k. |
| 82 | +- Release of 2 model weights in [CodeFuse series model weights](https://huggingface.co/codefuse-ai). |
| 83 | + |
| 84 | + |
| 85 | +## Requirements |
| 86 | +Firstly, you need to make sure you have installed CUDA(>=11.4, we have used 11.7) related, and torch(2.0.1) successfully. |
| 87 | + |
| 88 | +Then we provide init_env.sh to install required packages: |
| 89 | +```bash |
| 90 | +sh init_env.sh |
| 91 | +``` |
| 92 | +If you need flash attention, please install via reference https://github.com/Dao-AILab/flash-attention |
| 93 | + |
| 94 | + |
| 95 | +## Training |
| 96 | +🚀 [Huggingface accelerate + deepspeed Codebase for MFT(Multi-task Finetuning)](./mft_peft_hf/README.md) |
| 97 | + |
| 98 | +🚀 [Atorch Codebase for MFT(Multi-task Finetuning)](./mft_atorch/README.md) |
| 99 | + |
| 100 | + |
| 101 | +## Models |
| 102 | + |
| 103 | +We are releasing the 2 fowllowed CodeLLMs trianed by MFTCoder on Hugging Face. |
| 104 | + |
| 105 | +| Model | Base Model | Num of examples trained | Batch Size | Seq Length | Licence | |
| 106 | +|----------------------------------------------------------------------|--------------------|-------------------------|------------|------------|-----| |
| 107 | +| [🔥🔥🔥 CodeFuse-CodeLlama-34B](https://huggingface.co/codefuse-ai/) | CodeLlama-34b-Python | 600k | 80 | 4096 | | |
| 108 | +| [🔥 CodeFuse-13B](https://huggingface.co/codefuse-ai/) | CodeFuse-13B | 66k | 64 | 4096 | | |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | +## Datasets |
| 113 | +We are releasing the 2 fowllowed code-related instruction datasets on Hugging Face. |
| 114 | + |
| 115 | +| Dataset | Introduction | Licence | |
| 116 | +|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----| |
| 117 | +| [⭐ Evol-instruction-66k](https://huggingface.co/datasets/) | Based on open-evol-instruction-80k, filter out low-quality, repeated, and similar instructions to HumanEval, thus get high-quality code instruction dataset. | | |
| 118 | +| [⭐ CodeExercise-Python-27k](https://huggingface.co/datasets/) | python code exercise instruction dataset generated by chatgpt | | |
| 119 | + |
0 commit comments