Merge branch 'main' into mastercat

jeraud · Apr 11, 2024 · 3d4f8e4 · 3d4f8e4
2 parents 3def81e + 1877c79
commit 3d4f8e4
Show file tree

Hide file tree

Showing 12 changed files with 761 additions and 9 deletions.
diff --git a/.github/workflows/tiny_dset_test.yml b/.github/workflows/tiny_dset_test.yml
@@ -0,0 +1,49 @@
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Testing tiny datasets
+
+on:
+  push:
+    branches: ["main"]
+    paths-ignore:
+      - "**.md"
+  pull_request:
+    branches: ["main"]
+
+permissions:
+  contents: read
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v3
+        with:
+          python-version: "3.10"
+          cache: "pip"
+          cache-dependency-path: "**/*requirements*.txt"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install flake8 pytest
+          pip install -r dset-requirements.txt
+      - name: Find all the scripts subfolders and execute the testing script
+        env:
+          SSP_PDR_USR: ${{ secrets.SSP_PDR_USR }}
+          SSP_PDR_PWD: ${{ secrets.SSP_PDR_PWD }}
+        run: |
+          cd scripts
+          for folder in */; do
+            echo "Entering $folder" # Print that we are entering this particular folder
+            cd "$folder"
+            
+            if [ -f "test.sh" ]; then
+              bash test.sh
+            fi
+            echo "---------- Done --------" 
+            cd ..
+          done
diff --git a/.gitignore b/.gitignore
@@ -166,3 +166,8 @@ notebooks/*.jpg
 
 __pycache__
 lightning_logs
+
+# Excluding data files
+scripts/hsc/**/*.hdf5
+scripts/hsc/*.hdf
+scripts/hsc/*.fits
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -15,3 +15,16 @@ If you have a question, roadmap suggestion, or an idea for the AstroPile please
 If you can implement your proposed feature then [fork the AstroPile](https://docs.github.com/en/get-started/quickstart/fork-a-repo) and create a branch with a descriptive name.
 
 Once you have your feature implemented, [open up a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) and one of the AstroPile admins will review the code and merge to main or come back with comments. If your pull request is connected to an issue or roadmap item please do not forget to link it.
+
+## How to test your new dataset (HuggingFace)
+
+Let's pretend you're trying to add data from a new source `my_data_source` (e.g. a survey, simulation set, etc). First, make a directory `Astropile_prototype/scripts/my_data_source`, and populate with at least `build_parent_sample.py` and `my_data_source.py`.
+- `build_parent_sample.py` should download the data and save it in the standard HDF5 file format.
+- `my_data_source.py` is a HuggingFace dataset loading script for this data.
+
+To test, there are two options: 
+
+1. Run `build_parent_sample.py` with `output_dir` pointing to `Astropile_prototype/scripts/my_data_source`, which will download the data into the Astropile scripts location. Then, when opening the PR you'll have to add a `.gitignore` file that indicates that the data files should be ignored so they don't get pushed to remote.
+2. Run `build_parent_sample.py` with `output_dir` pointing elsewhere (e.g. to a scratch directory) and symlink `my_data_source.py` there. This is because the dataset loading script should be in the same directory as the HDF5 data (note that the dataset loading script must be named the same as the directory name)!
+
+Then, run `load_dataset('/path/to/output_dir')` to ensure the dataset loading works properly.
diff --git a/DESIGN.md b/DESIGN.md
@@ -116,6 +116,237 @@ Optional fields can include:
 
 - extra: an object with survey specific extra data or metadata not strictly necessary but perhaps useful
 
+## Illustrated HuggingFace Dataset generator
+
+The easiest way to add data to the AstroPile is via a [HuggingFace-style dataset generator](https://huggingface.co/docs/datasets/image_dataset#loading-script). Here we'll briefly go over the main parts of the generator, using the [DESI dataloading script](https://github.com/AstroPile/AstroPile_prototype/blob/main/scripts/desi/desi.py) as an example.
+
+First we import the usual suspects (`h5py` and `numpy` for data processing, as well as `itertools` for iterating over series). From HuggingFace we import the `datasets` module, alongside some 'features' that we will later use to define the data type in each column. You may need different columnar features for your dataset, and there is a list [available here](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Features).
+
+```python
+import datasets
+from datasets import Features, Value, Array2D, Sequence
+from datasets.data_files import DataFilesPatternsDict
+import itertools
+import h5py
+import numpy as np
+```
+
+Optionally in the script preamble we can add some metadata to our dataset, such as a citation pointing to an upstream source, a dataset description, a web link, code licence, and version number. These values will be folded via the `DatasetInfo` method in the `_info` function of our dataloader.
+
+```python
+_CITATION = """\
+@InProceedings{huggingface:dataset,
+title = {A great new dataset},
+author={huggingface, Inc.
+},
+year={2020}
+}
+"""
+
+_DESCRIPTION = """\
+Spectra dataset based on DESI EDR SV3.
+"""
+
+_HOMEPAGE = ""
+
+_LICENSE = ""
+
+_VERSION = "0.0.1"
+```
+
+We can also add our columnar features in the preamble, to be incorporated into our dataloader later in the script:
+
+```python
+_BOOL_FEATURES = [
+    "ZWARN"
+]
+
+_FLOAT_FEATURES = [
+    "Z",
+    "ZERR",
+    "EBV",
+    "FLUX_G",
+    "FLUX_R",
+    "FLUX_Z",
+    "FLUX_IVAR_G",
+    "FLUX_IVAR_R",
+    "FLUX_IVAR_Z",
+    "FIBERFLUX_G",
+    "FIBERFLUX_R",
+    "FIBERFLUX_Z",
+    "FIBERTOTFLUX_G",
+    "FIBERTOTFLUX_R",
+    "FIBERTOTFLUX_Z",
+]
+```
+
+Now the fun begins :rocket:. Here we set up a GeneratorBasedBuilder class. We'll go over each part of this class step by step.
+
+```python
+class DESI(datasets.GeneratorBasedBuilder):
+    """TODO: Short description of my dataset."""
+
+    VERSION = _VERSION
+```
+
+The [builder config](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/builder_classes#datasets.BuilderConfig) defines parameters that are used in the dataset building process, in the AstroPile we are working with `*.hdf5` files so we search for these in our dataset directory with `DataFilesPatternsDict.from_patterns`:
+
+```python
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(
+            name="edr_sv3",
+            version=VERSION,
+            data_files=DataFilesPatternsDict.from_patterns(
+                {"train": ["edr_sv3/healpix=*/*.hdf5"]}
+            ),
+            description="One percent survey of the DESI Early Data Release.",
+        ),
+    ]
+
+    DEFAULT_CONFIG_NAME = "edr_sv3"
+
+    _spectrum_length = 7781
+```
+
+The `_info` function defines the columnar features and other information about our dataset, here we have added comment explanations in-line so that the function flow is obvious.
+
+```python
+    @classmethod
+    def _info(self):
+        # First we add all features common to image datasets.
+        # Note that a Sequence requres sub-features so that we can parse it!
+        # For the spectrum sequence we have added four float32 Value features
+        features = {
+            "spectrum": Sequence({
+                "flux": Value(dtype="float32"),
+                "ivar": Value(dtype="float32"),
+                "lsf_sigma":  Value(dtype="float32"),
+                "lambda": Value(dtype="float32"),
+            }, length=self._spectrum_length)
+        }
+
+        # Now we adding all the values from the catalog that we defined earlier
+        # in the script, we can add them just like we would do to a normal python
+        # dict
+        for f in _FLOAT_FEATURES:
+            features[f] = Value("float32")
+
+        # Adding all boolean flags
+        for f in _BOOL_FEATURES:
+            features[f] = Value("bool")
+
+        # Finally we add an object ID for later cross matching and search
+        features["object_id"] = Value("string")
+
+        # And we return the above information as a DatasetInfo object,
+        # alongside some of the global params we defined in the preamble
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # This defines the different columns of the dataset and their types
+            features=Features(features),
+            # Homepage of the dataset for documentation
+            homepage=_HOMEPAGE,
+            # License for the dataset if available
+            license=_LICENSE,
+            # Citation for the dataset
+            citation=_CITATION,
+        )
+```
+
+The [split generator](https://huggingface.co/docs/datasets/image_dataset#download-and-define-the-dataset-splits) function splits our dataset into multiple chunks. Usually this is used for train/test/validation split, but here we define a single 'train' split.
+
+```python
+    def _split_generators(self, dl_manager):
+        """We handle string, list and dicts in datafiles"""
+        if not self.config.data_files:
+            raise ValueError(
+                f"At least one data file must be specified, but got data_files={self.config.data_files}"
+            )
+        data_files = dl_manager.download_and_extract(self.config.data_files)
+        if isinstance(data_files, (str, list, tuple)):
+            files = data_files
+            if isinstance(files, str):
+                files = [files]
+            # Use `dl_manager.iter_files` to skip hidden files in an extracted archive
+            files = [dl_manager.iter_files(file) for file in files]
+            return [
+                datasets.SplitGenerator(
+                    name=datasets.Split.TRAIN, gen_kwargs={"files": files}
+                )
+            ]
+        splits = []
+        for split_name, files in data_files.items():
+            if isinstance(files, str):
+                files = [files]
+            # Use `dl_manager.iter_files` to skip hidden files in an extracted archive
+            files = [dl_manager.iter_files(file) for file in files]
+            splits.append(
+                datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files})
+            )
+        return splits
+```
+
+Finally we define the example generator generator. This is a generator that yields rows of our dataset according to the structure we defined in our features dict. We want to yield an object ID string with every row of data.
+
+```python
+    def _generate_examples(self, files, object_ids=None):
+        """Yields examples as (key, example) tuples."""
+        for j, file in enumerate(itertools.chain.from_iterable(files)):
+            with h5py.File(file, "r") as data:
+                if object_ids is not None:
+                    keys = object_ids[j]
+                else:
+                    keys = data["object_id"][:]
+
+                # Preparing an index for fast searching through the catalog
+                sort_index = np.argsort(data["object_id"][:])
+                sorted_ids = data["object_id"][:][sort_index]
+
+                for k in keys:
+                    # Extract the indices of requested ids in the catalog 
+                    i = sort_index[np.searchsorted(sorted_ids, k)]
+
+                    # Parse spectrum data
+                    example = {
+                        "spectrum": 
+                            {
+                                "flux": data['spectrum_flux'][i], 
+                                "ivar": data['spectrum_ivar'][i],
+                                "lsf_sigma": data['spectrum_lsf_sigma'][i],
+                                "lambda": data['spectrum_lambda'][i],
+                            }
+                    }
+                    # Add all other requested features
+                    for f in _FLOAT_FEATURES:
+                        example[f] = data[f][i].astype("float32")
+
+                    # Add all boolean flags
+                    for f in _BOOL_FEATURES:
+                        example[f] = not bool(data[f][i])    # if flag is 0, then no problem
+
+                    # Add object_id
+                    example["object_id"] = str(data["object_id"][i])
+
+                    yield str(data["object_id"][i]), example
+```
+
+To load our newly generated dataset into a downstream script we can again use a HuggingFace tool (`datasets.load_dataset`):
+
+```python
+from datasets import load_dataset
+
+print('Preparing DESI dataset') 
+dset_desi = load_dataset(
+    'AstroPile/desi', # this is the path to the directory containing our dataloading script
+     trust_remote_code=True, # we need to enable this so that we can run a custom dataloading script
+     num_proc=32, # this is the number of parallel processes
+     cache_dir='./hf_cache' # this is the directory where HuggingFace caches the dataset
+)
+```
+
+Now we are working with a normal HF dataset object, so we can use [all the upstream code as-is](https://huggingface.co/docs/datasets/index).
+
 ## Data Architecture
 
 AstroPile will provide utilities to easily generate cross-matched or concatenated versions of these datasets. Below is an example of how the user may generate a cross-matched dataset from the HSC and DECaLS surveys:

diff --git a/README.md b/README.md
@@ -1,4 +1,6 @@
 # Prototype Implementation for AstroPile
+[![Testing tiny datasets](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml/badge.svg)](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml)
+
 Project to collect all the data!
 
 For a lightweight prototype of the functionality included in this repository, please see the [Lightweight Prototype](https://colab.research.google.com/drive/1t9dXqqeozrGjsx02q14a4Kmmp6GEhBYq?usp=sharing#scrollTo=yMKtJVxWlx24).

diff --git a/baselines/photo_z/photo_z_wrapper.py b/baselines/photo_z/photo_z_wrapper.py
@@ -0,0 +1,76 @@
+import sys
+sys.path.append('../')
+import torch
+from pytorch_lightning import LightningDataModule
+from torch.utils.data import Dataset, DataLoader
+from utils import split_dataset, compute_dataset_statistics, normalize_sample, get_nested
+from typing import Any
+
+class PhotoZWrapper(LightningDataModule):
+    def __init__(self, 
+                 train_dataset: Dataset, 
+                 test_dataset: Dataset,
+                 batch_size: int = 128, 
+                 num_workers: int = 8, 
+                 test_size: float = 0.2, 
+                 split_method: str = 'naive', 
+                 loading: str = 'iterated', 
+                 feature_flag: str = 'image',
+                 label_flag: str = 'z',
+                 dynamic_range: bool = True):
+        """
+        Initializes the data module with a dataset that is already loaded, setting up parameters
+        for data processing and batch loading.
+
+        Parameters:
+        - dataset (Dataset): The pre-loaded dataset, expected to be a torch.utils.data.Dataset with images in size B x C x H x W.
+        - batch_size (int): The size of each data batch for loading.
+        - num_workers (int): Number of subprocesses to use for data loading.
+        - test_size (float): The proportion of the dataset to reserve for testing.
+        - split_method (str): Strategy for splitting the dataset ('naive' implemented).
+        - loading (str): Approach for loading the dataset ('full' or 'iterated').
+        - feature_flag (str): The key in the dataset corresponding to the image data.
+        - label_flag (str): The key in the dataset corresponding to the redshift data.
+        - dynamic_range (bool): Flag indicating whether dynamic range compression should be applied.
+        """
+
+        self.train_dataset = train_dataset
+        self.test_dataset = test_dataset
+        self.batch_size = batch_size
+        self.num_workers = num_workers
+        self.test_size = test_size
+        self.loading = loading
+        self.dynamic_range = dynamic_range
+        self.feature_flag = feature_flag
+        self.label_flag = label_flag
+
+    def prepare_data(self):
+        # Compute the dataset statistics
+        self.img_mean, self.img_std = compute_dataset_statistics(self.train_dataset, flag=self.feature_flag, loading=self.loading)
+        self.z_mean, self.z_std = compute_dataset_statistics(self.train_dataset, flag=self.label_flag, loading=self.loading)
+
+        # For correct broadcasting
+        self.img_mean, self.img_std = self.img_mean[:,None,None], self.img_std[:,None,None]
+
+        # Split the dataset
+        train_test_split = self.train_dataset.train_test_split(test_size=self.test_size)
+        self.train_dataset = train_test_split['train']
+        self.val_dataset = train_test_split['test']
+
+    def setup(self, stage=None):
+        pass
+
+    def collate_fn(self, batch):
+        batch = torch.utils.data.default_collate(batch)
+        x = normalize_sample(get_nested(batch, self.feature_flag), self.img_mean, self.img_std, dynamic_range=self.dynamic_range) # dynamic range compression and z-score normalization
+        y = normalize_sample(get_nested(batch, self.label_flag), self.z_mean, self.z_std, dynamic_range=False)
+        return x, y
+
+    def train_dataloader(self):
+        return DataLoader(self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)
+
+    def val_dataloader(self):
+        return DataLoader(self.val_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)
+
+    def test_dataloader(self):
+        return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)