Checkout: https://gpu-mode.github.io/popcorn/
We're training an llm to write gpu code ... and we need to get the data to do it. This repo contains the scripts needed to generate said data.
At the moment this repo contains the tools to scrape github for every triton kernel out there, and turn it into a useful json folder. However, we plan to do much
Currently the way this repo is setup is that it relies on spitting out intermediate json files into the datasets/
and github_data
folders. Currently we have the following files. They are generated in the following order
# generated by download_repos.py
github_data/triton/github_queries.json # contains the github queries which are run to collect github data
github_data/triton/github_repos.json # contains repos / associated hashes we caare about downloading
# if github_repos exists, then we attempt to download all repos into github_downloads/triton from github (ensure folder is empty before downloading)
github_data/triton/github_metadata.json # contains metadata for the github repos we're downloading
github_data/triton/github_metadata.json # contains metadata for the github repos filtered to only files containing @triton.jit
# generated by create_keyword_dataset.py
datasets/triton/dataset.json # contains full dataset of triton kernels
datasets/triton/dataset_dedup.json # dedups exact matches of triton kernels.