Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending this codebase #39

Open
chris-ha458 opened this issue Jun 18, 2023 · 0 comments
Open

Extending this codebase #39

chris-ha458 opened this issue Jun 18, 2023 · 0 comments

Comments

@chris-ha458
Copy link

I was looking at this codebase and encountered this bit:
https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing

The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with processing.py. Note that there is a bug in the script that filters only for GPL licenses instead of filtering them out. There are instructions to remove the bug but it is left there for reproducibility.

This leads me to believe that the code here is meant to be used and investigated "as is" and without modification.
Is this repo primarily meant for reproducibility?

If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant