A project to extract, preprocess, and fine-tune datasets for large language models like CodeQwen using ChatGPT data.
.
├── Makefile # Automates tasks like data extraction and export
├── README.md # Project documentation
├── chatgpt_export/ # Directory containing exported ChatGPT data
│ ├── conversations/ # Individual conversations in plain text
│ ├── conversations.json # Exported conversations
│ ├── prepared_data/ # Processed data for fine-tuning
│ └── ... # Other exported metadata
├── notebooks/ # Jupyter notebooks for fine-tuning
│ └── fine_tune_codeqwen.ipynb # Notebook for fine-tuning CodeQwen
└── scripts/ # Python scripts for data preparation
├── export_to_conversations.py # Converts extracted data into conversational format
└── extract_prompts.py # Extracts prompts from ChatGPT data
- Python 3.8 or higher
- Required Python packages (install via
requirements.txt
):pip install -r requirements.txt
- Exported data from ChatGPT, or any other training data
-
Extract Prompts: Run the
extract_prompts.py
script to extract prompts from ChatGPT-exported data:make extract
-
Export to Conversations: Format the extracted prompts into a conversational format:
make export
-
Fine-Tune: Use the formatted data to fine-tune a large language model. Refer to
notebooks/fine_tune_codeqwen.ipynb
for detailed steps. -
Clean Up: Remove intermediate and output files:
make clean
This project is licensed under the MIT License.
We welcome contributions! Feel free to open an issue or submit a pull request.
Once I get my GPU, this will accelerate !