CCA GPT Model Trainer

This repository contains a script created specifically for my employer, designed with MediaWiki and Jira Cloud in mind. The tool automates the process of fetching, cleaning, and combining data from MediaWiki and Jira, and then fine-tuning a language model (GPT) on the combined dataset. Currently, all settings are hardcoded in the Python code, but I will be changing that soon. The script is optimized to handle GPU memory constraints and can switch to CPU if needed.

Requirements

Python 3.x
Required Python packages (install using pip install -r requirements.txt):
- art
- colorama
- MySQLdb
- beautifulsoup4
- requests
- transformers
- torch
- datasets

Usage

Clone the repository:

git clone https://github.com/yourusername/cca-gpt-model-trainer.git
cd cca-gpt-model-trainer

Install dependencies:
```
pip install -r requirements.txt
```

Update settings in the script:

Open the cca-gpt-model-trainer.py script and update the following settings:

MySQL Database Connection:

connection = MySQLdb.connect(
    host="localhost", user="grahf", password="<password>", database="local_wiki"
)

Change host, user, password, and database to match your MediaWiki database credentials.

Jira API Connection:

url = "https://site.atlassian.net/rest/api/3/search"  # CHANGE THIS TO APPROPRIATE JIRA URL
params = {
    "jql": "project = CSS",  # CHANGE THIS TO APPROPRIATE PROJECT CODE
    "maxResults": 3000,
    "fields": "summary,description,comment",
}

email = "[email protected]"  # CHANGE THIS TO APPROPRIATE JIRA USER
api_token = "<token>"  # ADD JIRA TOKEN

Change the url, params['jql'], email, and api_token to match your Jira Cloud instance and credentials.

Run the script:
```
python cca-gpt-model-trainer.py
```

Script Functions

fetch_mediawiki_data(): Fetches MediaWiki data and saves it to a text file. Ensure you update the database connection settings.
clean_mediawiki_data(): Cleans the MediaWiki data by removing HTML tags and other unnecessary content.
fetch_jira_data(): Fetches Jira entries using the Jira REST API and saves them to a JSON file. Ensure you update the Jira connection settings.
combine_files(jira_file, mediawiki_file, combined_file): Combines Jira and MediaWiki data into a single text file for training.
download_tokenizer_files(model_name, output_dir): Downloads the tokenizer files for the specified model.
fine_tune_gpt_model(data_file, output_dir): Fine-tunes a GPT model on the combined data.

Future Improvements

Externalize settings to a configuration file to avoid hardcoding values in the script.
Add logging for better traceability and debugging.
Improve error handling and retry mechanisms.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Dean Thomson (grahfmusic) - GitHub

Name	Name	Last commit message	Last commit date
Latest commit grahfmusic updated script to refine and create better training Jun 18, 2024 7fc8c4f · Jun 18, 2024 History 8 Commits
.gitignore	.gitignore	Initial commit	Jun 18, 2024
LICENSE	LICENSE	Initial commit	Jun 18, 2024
README.md	README.md	Update README.md	Jun 18, 2024
cca-gpt-model-trainer.py	cca-gpt-model-trainer.py	updated script to refine and create better training	Jun 18, 2024
readme_header.png	readme_header.png	Add files via upload	Jun 18, 2024
requirements.txt	requirements.txt	added requirements.txt for dependencies	Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCA GPT Model Trainer

Requirements

Usage

Script Functions

Future Improvements

License

Author

About

Releases

Packages

Languages

License

grahfmusic/cca-gpt-model-trainer

Folders and files

Latest commit

History

Repository files navigation

CCA GPT Model Trainer

Requirements

Usage

Script Functions

Future Improvements

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages