-
Notifications
You must be signed in to change notification settings - Fork 225
Dataset
We used a number of datasets to train different versions of our model, all of them code related:
- Our
Code Clippy Data
: A dataset scraped of text data (mostly programming languages) scraped from GitHub. -
APPS
: A dataset of various programming competition problems. -
CodeSearchNet Challenge Data
: A dataset of methods from six different programming languages.
To create this dataset, we used https://seart-ghs.si.usi.ch/ to filter and collect GitHub repositories to scrape text data. We used the following filters:
star_count > 10
exclude_forks
commit_count > 2
From these repositories, we further filtered based on their size (size_bytes
). Specifically, we removed any repositories with a size greater than the 95th percentile (70,708 bytes
) to avoid downloading large binaries or repositories with lots of autogenerated content.
Next, we combine these repositories with the repositories collected in the GitHub section of the Pile, making sure to remove duplicates. Finally, we use EleutherAI's git-downloader tool to download the repositories into the LM_Dataformat format.
This resulted in a dataset of ~670,000
unique repositories or ~209GBs
of compressed text data. we split this dataset into training, validation, and testing using 95/2.5/2.5. Previous work has shown that GitHub can contain a large number of duplicate code and that duplicate text can impact the training of large language models, both for natural languages and code. Therefore, we also created a deduplication version of our dataset where near-duplicates are removed. This process was inspired by this tool. However, we found it unable to scale to the size of our dataset. Therefore, we created a simpler near-duplicate removal process where the text in a file is tokenized and the hash of these tokens (unordered) are stored. Anytime a new file is added to the dataset, the hash is computed and compared to the stored hashes. If the hash is found to be a near duplicate, the file is removed. Our script for doing this can be found here. This resulted in a reduction to ~132GBs
of compressed text data, which was then split similarly to the original dataset.
We are currently working on getting these datasets (duplicate and non-duplicate versions) into the HuggingFace's Datasets library. Stay tuned!
The Pre-Trained Model is fine-tuned with APPS Dataset. APPS Benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. The Fine-Tuning is done by giving initial context by giving the Natural Language Prompt alongside the Starter Code, Sample Input/Output.
We are currently working on getting this data into the HuggingFace's Datasets library. Stay tuned!