Extending this codebase #39

chris-ha458 · 2023-06-18T04:48:29Z

I was looking at this codebase and encountered this bit:
https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing

The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with processing.py. Note that there is a bug in the script that filters only for GPL licenses instead of filtering them out. There are instructions to remove the bug but it is left there for reproducibility.

This leads me to believe that the code here is meant to be used and investigated "as is" and without modification.
Is this repo primarily meant for reproducibility?

If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending this codebase #39

Extending this codebase #39

chris-ha458 commented Jun 18, 2023

Extending this codebase #39

Extending this codebase #39

Comments

chris-ha458 commented Jun 18, 2023