forked from huggingface/datasets
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Changing the name * style + quality * update doc and logo * clean up * circle-CI on the branche for now * fix daily dialog dataset * fix urls Co-authored-by: Quentin Lhoest <[email protected]>
- Loading branch information
Showing
428 changed files
with
5,147 additions
and
4,898 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,13 @@ | ||
# How to contribute to nlp? | ||
# How to contribute to Datasets? | ||
|
||
1. Fork the [repository](https://github.com/huggingface/nlp) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account. | ||
1. Fork the [repository](https://github.com/huggingface/datasets) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account. | ||
|
||
2. Clone your fork to your local disk, and add the base repository as a remote: | ||
|
||
```bash | ||
git clone [email protected]:<your Github handle>/nlp.git | ||
cd nlp | ||
git remote add upstream https://github.com/huggingface/nlp.git | ||
git clone [email protected]:<your Github handle>/datasets.git | ||
cd datasets | ||
git remote add upstream https://github.com/huggingface/datasets.git | ||
``` | ||
|
||
3. Create a new branch to hold your development changes: | ||
|
@@ -24,11 +24,11 @@ | |
pip install -e ".[dev]" | ||
``` | ||
|
||
(If nlp was already installed in the virtual environment, remove | ||
it with `pip uninstall nlp` before reinstalling it in editable | ||
(If datasets was already installed in the virtual environment, remove | ||
it with `pip uninstall datasets` before reinstalling it in editable | ||
mode with the `-e` flag.) | ||
|
||
5. Develop the features on your branch. If you want to add a dataset see more in-detail intsructions in the section [*How to add a dataset*](#how-to-add-a-dataset). Alternatively, you can follow the steps to [add a dataset](https://huggingface.co/nlp/add_dataset.html) and [share a dataset](https://huggingface.co/nlp/share_dataset.html) in the documentation. | ||
5. Develop the features on your branch. If you want to add a dataset see more in-detail intsructions in the section [*How to add a dataset*](#how-to-add-a-dataset). Alternatively, you can follow the steps to [add a dataset](https://huggingface.co/datasets/add_dataset.html) and [share a dataset](https://huggingface.co/datasets/share_dataset.html) in the documentation. | ||
|
||
6. Format your code. Run black and isort so that your newly added files look nice with the following command: | ||
|
||
|
@@ -60,20 +60,20 @@ | |
8. Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review. | ||
|
||
## How-To-Add a dataset | ||
1. Make sure you followed steps 1-4 of the section [*How to contribute to nlp?*](#how-to-contribute-to-nlp). | ||
1. Make sure you followed steps 1-4 of the section [*How to contribute to datasets?*](#how-to-contribute-to-datasets). | ||
|
||
2. Create your dataset folder under `datasets/<your_dataset_name>` and create your dataset script under `datasets/<your_dataset_name>/<your_dataset_name>.py`. You can check out other dataset scripts under `datasets` for some inspiration. Note on naming: the dataset class should be camel case, while the dataset name is its snake case equivalent (ex: `class BookCorpus(nlp.GeneratorBasedBuilder)` for the dataset `book_corpus`). | ||
2. Create your dataset folder under `datasets/<your_dataset_name>` and create your dataset script under `datasets/<your_dataset_name>/<your_dataset_name>.py`. You can check out other dataset scripts under `datasets` for some inspiration. Note on naming: the dataset class should be camel case, while the dataset name is its snake case equivalent (ex: `class BookCorpus(datasets.GeneratorBasedBuilder)` for the dataset `book_corpus`). | ||
|
||
3. **Make sure you run all of the following commands from the root of your `nlp` git clone.** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command: | ||
3. **Make sure you run all of the following commands from the root of your `datasets` git clone.** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command: | ||
|
||
```bash | ||
python nlp-cli test datasets/<your-dataset-folder> --save_infos --all_configs | ||
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs | ||
``` | ||
|
||
4. If the command was succesful, you should now create some dummy data. Use the following command to get in-detail instructions on how to create the dummy data: | ||
|
||
```bash | ||
python nlp-cli dummy_data datasets/<your-dataset-folder> | ||
python datasets-cli dummy_data datasets/<your-dataset-folder> | ||
``` | ||
|
||
5. Now test that both the real data and the dummy data work correctly using the following commands: | ||
|
@@ -89,7 +89,7 @@ | |
RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_<your-dataset-name> | ||
``` | ||
|
||
6. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to nlp?*](#how-to-contribute-to-nlp). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below. | ||
6. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below. | ||
|
||
|
||
### Help for dummy data tests | ||
|
@@ -98,7 +98,7 @@ Follow these steps in case the dummy data test keeps failing: | |
|
||
- Verify that all filenames are spelled correctly. Rerun the command | ||
```bash | ||
python nlp-cli dummy_data datasets/<your-dataset-folder> | ||
python datasets-cli dummy_data datasets/<your-dataset-folder> | ||
``` | ||
and make sure you follow the exact instructions provided by the command of step 5). | ||
|
||
|
Oops, something went wrong.