From 767493538412e0cbaa60f5deadd08d70ce7ec0f0 Mon Sep 17 00:00:00 2001 From: Helen Qu <8826297+helenqu@users.noreply.github.com> Date: Thu, 28 Mar 2024 16:28:59 -0400 Subject: [PATCH] Update CONTRIBUTING.md --- CONTRIBUTING.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index adc5a4af..7ae2a7f9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -15,3 +15,16 @@ If you have a question, roadmap suggestion, or an idea for the AstroPile please If you can implement your proposed feature then [fork the AstroPile](https://docs.github.com/en/get-started/quickstart/fork-a-repo) and create a branch with a descriptive name. Once you have your feature implemented, [open up a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) and one of the AstroPile admins will review the code and merge to main or come back with comments. If your pull request is connected to an issue or roadmap item please do not forget to link it. + +## How to test your new dataset (HuggingFace) + +Let's pretend you're trying to add data from a new source `my_data_source` (e.g. a survey, simulation set, etc). First, make a directory `Astropile_prototype/scripts/my_data_source`, and populate with at least `build_parent_sample.py` and `my_data_source.py`. +- `build_parent_sample.py` should download the data and save it in the standard HDF5 file format. +- `my_data_source.py` is a HuggingFace dataset loading script for this data. + +To test, there are two options: + +1. Run `build_parent_sample.py` with `output_dir` pointing to `Astropile_prototype/scripts/my_data_source`, which will download the data into the Astropile scripts location. Then, when opening the PR you'll have to add a `.gitignore` file that indicates that the data files should be ignored so they don't get pushed to remote. +2. Run `build_parent_sample.py` with `output_dir` pointing elsewhere (e.g. to a scratch directory) and symlink `my_data_source.py` there. This is because the dataset loading script should be in the same directory as the HDF5 data (note that the dataset loading script must be named the same as the directory name)! + +Then, run `load_dataset('/path/to/output_dir')` to ensure the dataset loading works properly.