You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with processing.py. Note that there is a bug in the script that filters only for GPL licenses instead of filtering them out. There are instructions to remove the bug but it is left there for reproducibility.
This leads me to believe that the code here is meant to be used and investigated "as is" and without modification.
Is this repo primarily meant for reproducibility?
If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?
The text was updated successfully, but these errors were encountered:
I was looking at this codebase and encountered this bit:
https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing
This leads me to believe that the code here is meant to be used and investigated "as is" and without modification.
Is this repo primarily meant for reproducibility?
If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?
The text was updated successfully, but these errors were encountered: