ChatGPT Data Preparation and Fine-Tuning Pipeline

A project to extract, preprocess, and fine-tune datasets for large language models like CodeQwen using ChatGPT data.

Project Structure

.
├── Makefile                       # Automates tasks like data extraction and export
├── README.md                      # Project documentation
├── chatgpt_export/                # Directory containing exported ChatGPT data
│   ├── conversations/             # Individual conversations in plain text
│   ├── conversations.json         # Exported conversations
│   ├── prepared_data/             # Processed data for fine-tuning
│   └── ...                        # Other exported metadata
├── notebooks/                     # Jupyter notebooks for fine-tuning
│   └── fine_tune_codeqwen.ipynb   # Notebook for fine-tuning CodeQwen
└── scripts/                       # Python scripts for data preparation
    ├── export_to_conversations.py # Converts extracted data into conversational format
    └── extract_prompts.py         # Extracts prompts from ChatGPT data

Getting Started

Prerequisites

Python 3.8 or higher
Required Python packages (install via requirements.txt):
```
pip install -r requirements.txt
```
Exported data from ChatGPT, or any other training data

How to Use

Extract Prompts: Run the extract_prompts.py script to extract prompts from ChatGPT-exported data:
```
make extract
```
Export to Conversations: Format the extracted prompts into a conversational format:
```
make export
```
Fine-Tune: Use the formatted data to fine-tune a large language model. Refer to notebooks/fine_tune_codeqwen.ipynb for detailed steps.
Clean Up: Remove intermediate and output files:
```
make clean
```

License

This project is licensed under the MIT License.

Contributing

We welcome contributions! Feel free to open an issue or submit a pull request.

Once I get my GPU, this will accelerate !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatGPT Data Preparation and Fine-Tuning Pipeline

Project Structure

Getting Started

Prerequisites

How to Use

License

Contributing

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

neatnettech/ollama_chatgpt_private

Folders and files

Latest commit

History

Repository files navigation

ChatGPT Data Preparation and Fine-Tuning Pipeline

Project Structure

Getting Started

Prerequisites

How to Use

License

Contributing

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages