Skip to content

Dataset

Arun Raja edited this page Jul 25, 2021 · 21 revisions

We used a number of datasets to train different versions of our model, all of them code related:

  • Our Code Clippy Data: A dataset scraped of text data (mostly programming languages) scraped from GitHub.
  • APPS: A dataset of various programming competition problems.
  • CodeSearchNet Challenge Data: A dataset of methods from six different programming languages.

Out Code Clippy Dataset:

To create this dataset, we used https://seart-ghs.si.usi.ch/ to filter and collect GitHub repositories to scrape text data. We used the following filters:

  • star_count > 10
  • exclude_forks
  • commit_count > 2

From these repositories, we further filtered based on their size (size_bytes). Specifically, we removed any repositories with a size greater than the 95th percentile (70,708 bytes) to avoid downloading large binaries or repositories with lots of autogenerated content.

Next, we combine these repositories with the repositories collected in the GitHub section of the Pile, making sure to remove duplicates. Finally, we use EleutherAI's git-downloader tool to download the repositories into the LM_Dataformat format. To download the data quickly, the data to be downloaded was split among 4 different TPUs and were merged together as one after the downloads were complete.

This resulted in a dataset of ~670,000 unique repositories or ~209GBs of compressed text data. we split this dataset into training, validation, and testing using 95/2.5/2.5. Previous work has shown that GitHub can contain a large number of duplicate code and that duplicate text can impact the training of large language models, both for natural languages and code. Therefore, we also created a deduplication version of our dataset where near-duplicates are removed. This process was inspired by this tool. However, we found it unable to scale to the size of our dataset. Therefore, we created a simpler near-duplicate removal process where the text in a file is tokenized and the hash of these tokens (unordered) are stored. Anytime a new file is added to the dataset, the hash is computed and compared to the stored hashes. If the hash is found to be a near duplicate, the file is removed. Our script for doing this can be found here. This resulted in a reduction to ~132GBs of compressed text data, which was then split similarly to the original dataset.

We are currently working on getting these datasets (duplicate and non-duplicate versions) into the HuggingFace's Datasets library. Stay tuned!

Datasets used for Fine-Tuning:

The Pre-Trained Model is fine-tuned with APPS Dataset. APPS Benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. The Fine-Tuning is done by giving initial context by giving the Natural Language Prompt alongside the Starter Code, Sample Input/Output.

We are currently working on getting this data into the HuggingFace's Datasets library. Stay tuned!

Page Directory

Clone this wiki locally