Project Popcorn Data Collection Utlities

Checkout: https://gpu-mode.github.io/popcorn/

We're training an llm to write gpu code ... and we need to get the data to do it. This repo contains the scripts needed to generate said data.

At the moment this repo contains the tools to scrape github for every triton kernel out there, and turn it into a useful json folder. However, we plan to do much

Running things

Currently the way this repo is setup is that it relies on spitting out intermediate json files into the datasets/ and github_data folders. Currently we have the following files. They are generated in the following order

# generated by download_repos.py
github_data/triton/github_queries.json # contains the github queries which are run to collect github data
github_data/triton/github_repos.json # contains repos / associated hashes we caare about downloading
# if github_repos exists, then we attempt to download all repos into github_downloads/triton from github (ensure folder is empty before downloading)
github_data/triton/github_metadata.json # contains metadata for the github repos we're downloading
github_data/triton/github_metadata.json # contains metadata for the github repos filtered to only files containing @triton.jit

# generated by create_keyword_dataset.py
datasets/triton/dataset.json # contains full dataset of triton kernels
datasets/triton/dataset_dedup.json # dedups exact matches of triton kernels.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
datasets/triton		datasets/triton
github_data/triton		github_data/triton
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cleaning_utils.py		cleaning_utils.py
create_keyword_dataset.py		create_keyword_dataset.py
download_repos.py		download_repos.py
extract_metadata.py		extract_metadata.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Popcorn Data Collection Utlities

Running things

About

Releases

Packages

Languages

License

gpu-mode/popcorn_data_utils

Folders and files

Latest commit

History

Repository files navigation

Project Popcorn Data Collection Utlities

Running things

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages