Skip to content

Commit

Permalink
Merge branch 'main' into mastercat
Browse files Browse the repository at this point in the history
  • Loading branch information
lhparker1 authored Apr 11, 2024
2 parents 3def81e + 1877c79 commit 3d4f8e4
Show file tree
Hide file tree
Showing 12 changed files with 761 additions and 9 deletions.
49 changes: 49 additions & 0 deletions .github/workflows/tiny_dset_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Testing tiny datasets

on:
push:
branches: ["main"]
paths-ignore:
- "**.md"
pull_request:
branches: ["main"]

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
cache: "pip"
cache-dependency-path: "**/*requirements*.txt"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
pip install -r dset-requirements.txt
- name: Find all the scripts subfolders and execute the testing script
env:
SSP_PDR_USR: ${{ secrets.SSP_PDR_USR }}
SSP_PDR_PWD: ${{ secrets.SSP_PDR_PWD }}
run: |
cd scripts
for folder in */; do
echo "Entering $folder" # Print that we are entering this particular folder
cd "$folder"
if [ -f "test.sh" ]; then
bash test.sh
fi
echo "---------- Done --------"
cd ..
done
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,8 @@ notebooks/*.jpg

__pycache__
lightning_logs

# Excluding data files
scripts/hsc/**/*.hdf5
scripts/hsc/*.hdf
scripts/hsc/*.fits
13 changes: 13 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,16 @@ If you have a question, roadmap suggestion, or an idea for the AstroPile please
If you can implement your proposed feature then [fork the AstroPile](https://docs.github.com/en/get-started/quickstart/fork-a-repo) and create a branch with a descriptive name.

Once you have your feature implemented, [open up a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) and one of the AstroPile admins will review the code and merge to main or come back with comments. If your pull request is connected to an issue or roadmap item please do not forget to link it.

## How to test your new dataset (HuggingFace)

Let's pretend you're trying to add data from a new source `my_data_source` (e.g. a survey, simulation set, etc). First, make a directory `Astropile_prototype/scripts/my_data_source`, and populate with at least `build_parent_sample.py` and `my_data_source.py`.
- `build_parent_sample.py` should download the data and save it in the standard HDF5 file format.
- `my_data_source.py` is a HuggingFace dataset loading script for this data.

To test, there are two options:

1. Run `build_parent_sample.py` with `output_dir` pointing to `Astropile_prototype/scripts/my_data_source`, which will download the data into the Astropile scripts location. Then, when opening the PR you'll have to add a `.gitignore` file that indicates that the data files should be ignored so they don't get pushed to remote.
2. Run `build_parent_sample.py` with `output_dir` pointing elsewhere (e.g. to a scratch directory) and symlink `my_data_source.py` there. This is because the dataset loading script should be in the same directory as the HDF5 data (note that the dataset loading script must be named the same as the directory name)!

Then, run `load_dataset('/path/to/output_dir')` to ensure the dataset loading works properly.
231 changes: 231 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,237 @@ Optional fields can include:

- extra: an object with survey specific extra data or metadata not strictly necessary but perhaps useful

## Illustrated HuggingFace Dataset generator

The easiest way to add data to the AstroPile is via a [HuggingFace-style dataset generator](https://huggingface.co/docs/datasets/image_dataset#loading-script). Here we'll briefly go over the main parts of the generator, using the [DESI dataloading script](https://github.com/AstroPile/AstroPile_prototype/blob/main/scripts/desi/desi.py) as an example.

First we import the usual suspects (`h5py` and `numpy` for data processing, as well as `itertools` for iterating over series). From HuggingFace we import the `datasets` module, alongside some 'features' that we will later use to define the data type in each column. You may need different columnar features for your dataset, and there is a list [available here](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Features).

```python
import datasets
from datasets import Features, Value, Array2D, Sequence
from datasets.data_files import DataFilesPatternsDict
import itertools
import h5py
import numpy as np
```

Optionally in the script preamble we can add some metadata to our dataset, such as a citation pointing to an upstream source, a dataset description, a web link, code licence, and version number. These values will be folded via the `DatasetInfo` method in the `_info` function of our dataloader.

```python
_CITATION = """\
@InProceedings{huggingface:dataset,
title = {A great new dataset},
author={huggingface, Inc.
},
year={2020}
}
"""

_DESCRIPTION = """\
Spectra dataset based on DESI EDR SV3.
"""

_HOMEPAGE = ""

_LICENSE = ""

_VERSION = "0.0.1"
```

We can also add our columnar features in the preamble, to be incorporated into our dataloader later in the script:

```python
_BOOL_FEATURES = [
"ZWARN"
]

_FLOAT_FEATURES = [
"Z",
"ZERR",
"EBV",
"FLUX_G",
"FLUX_R",
"FLUX_Z",
"FLUX_IVAR_G",
"FLUX_IVAR_R",
"FLUX_IVAR_Z",
"FIBERFLUX_G",
"FIBERFLUX_R",
"FIBERFLUX_Z",
"FIBERTOTFLUX_G",
"FIBERTOTFLUX_R",
"FIBERTOTFLUX_Z",
]
```

Now the fun begins :rocket:. Here we set up a GeneratorBasedBuilder class. We'll go over each part of this class step by step.

```python
class DESI(datasets.GeneratorBasedBuilder):
"""TODO: Short description of my dataset."""

VERSION = _VERSION
```

The [builder config](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/builder_classes#datasets.BuilderConfig) defines parameters that are used in the dataset building process, in the AstroPile we are working with `*.hdf5` files so we search for these in our dataset directory with `DataFilesPatternsDict.from_patterns`:

```python
BUILDER_CONFIGS = [
datasets.BuilderConfig(
name="edr_sv3",
version=VERSION,
data_files=DataFilesPatternsDict.from_patterns(
{"train": ["edr_sv3/healpix=*/*.hdf5"]}
),
description="One percent survey of the DESI Early Data Release.",
),
]

DEFAULT_CONFIG_NAME = "edr_sv3"

_spectrum_length = 7781
```

The `_info` function defines the columnar features and other information about our dataset, here we have added comment explanations in-line so that the function flow is obvious.

```python
@classmethod
def _info(self):
# First we add all features common to image datasets.
# Note that a Sequence requres sub-features so that we can parse it!
# For the spectrum sequence we have added four float32 Value features
features = {
"spectrum": Sequence({
"flux": Value(dtype="float32"),
"ivar": Value(dtype="float32"),
"lsf_sigma": Value(dtype="float32"),
"lambda": Value(dtype="float32"),
}, length=self._spectrum_length)
}

# Now we adding all the values from the catalog that we defined earlier
# in the script, we can add them just like we would do to a normal python
# dict
for f in _FLOAT_FEATURES:
features[f] = Value("float32")

# Adding all boolean flags
for f in _BOOL_FEATURES:
features[f] = Value("bool")

# Finally we add an object ID for later cross matching and search
features["object_id"] = Value("string")

# And we return the above information as a DatasetInfo object,
# alongside some of the global params we defined in the preamble
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
description=_DESCRIPTION,
# This defines the different columns of the dataset and their types
features=Features(features),
# Homepage of the dataset for documentation
homepage=_HOMEPAGE,
# License for the dataset if available
license=_LICENSE,
# Citation for the dataset
citation=_CITATION,
)
```

The [split generator](https://huggingface.co/docs/datasets/image_dataset#download-and-define-the-dataset-splits) function splits our dataset into multiple chunks. Usually this is used for train/test/validation split, but here we define a single 'train' split.

```python
def _split_generators(self, dl_manager):
"""We handle string, list and dicts in datafiles"""
if not self.config.data_files:
raise ValueError(
f"At least one data file must be specified, but got data_files={self.config.data_files}"
)
data_files = dl_manager.download_and_extract(self.config.data_files)
if isinstance(data_files, (str, list, tuple)):
files = data_files
if isinstance(files, str):
files = [files]
# Use `dl_manager.iter_files` to skip hidden files in an extracted archive
files = [dl_manager.iter_files(file) for file in files]
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN, gen_kwargs={"files": files}
)
]
splits = []
for split_name, files in data_files.items():
if isinstance(files, str):
files = [files]
# Use `dl_manager.iter_files` to skip hidden files in an extracted archive
files = [dl_manager.iter_files(file) for file in files]
splits.append(
datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files})
)
return splits
```

Finally we define the example generator generator. This is a generator that yields rows of our dataset according to the structure we defined in our features dict. We want to yield an object ID string with every row of data.

```python
def _generate_examples(self, files, object_ids=None):
"""Yields examples as (key, example) tuples."""
for j, file in enumerate(itertools.chain.from_iterable(files)):
with h5py.File(file, "r") as data:
if object_ids is not None:
keys = object_ids[j]
else:
keys = data["object_id"][:]

# Preparing an index for fast searching through the catalog
sort_index = np.argsort(data["object_id"][:])
sorted_ids = data["object_id"][:][sort_index]

for k in keys:
# Extract the indices of requested ids in the catalog
i = sort_index[np.searchsorted(sorted_ids, k)]

# Parse spectrum data
example = {
"spectrum":
{
"flux": data['spectrum_flux'][i],
"ivar": data['spectrum_ivar'][i],
"lsf_sigma": data['spectrum_lsf_sigma'][i],
"lambda": data['spectrum_lambda'][i],
}
}
# Add all other requested features
for f in _FLOAT_FEATURES:
example[f] = data[f][i].astype("float32")

# Add all boolean flags
for f in _BOOL_FEATURES:
example[f] = not bool(data[f][i]) # if flag is 0, then no problem

# Add object_id
example["object_id"] = str(data["object_id"][i])

yield str(data["object_id"][i]), example
```

To load our newly generated dataset into a downstream script we can again use a HuggingFace tool (`datasets.load_dataset`):

```python
from datasets import load_dataset

print('Preparing DESI dataset')
dset_desi = load_dataset(
'AstroPile/desi', # this is the path to the directory containing our dataloading script
trust_remote_code=True, # we need to enable this so that we can run a custom dataloading script
num_proc=32, # this is the number of parallel processes
cache_dir='./hf_cache' # this is the directory where HuggingFace caches the dataset
)
```

Now we are working with a normal HF dataset object, so we can use [all the upstream code as-is](https://huggingface.co/docs/datasets/index).

## Data Architecture

AstroPile will provide utilities to easily generate cross-matched or concatenated versions of these datasets. Below is an example of how the user may generate a cross-matched dataset from the HSC and DECaLS surveys:
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Prototype Implementation for AstroPile
[![Testing tiny datasets](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml/badge.svg)](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml)

Project to collect all the data!

For a lightweight prototype of the functionality included in this repository, please see the [Lightweight Prototype](https://colab.research.google.com/drive/1t9dXqqeozrGjsx02q14a4Kmmp6GEhBYq?usp=sharing#scrollTo=yMKtJVxWlx24).
Expand Down
76 changes: 76 additions & 0 deletions baselines/photo_z/photo_z_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import sys
sys.path.append('../')
import torch
from pytorch_lightning import LightningDataModule
from torch.utils.data import Dataset, DataLoader
from utils import split_dataset, compute_dataset_statistics, normalize_sample, get_nested
from typing import Any

class PhotoZWrapper(LightningDataModule):
def __init__(self,
train_dataset: Dataset,
test_dataset: Dataset,
batch_size: int = 128,
num_workers: int = 8,
test_size: float = 0.2,
split_method: str = 'naive',
loading: str = 'iterated',
feature_flag: str = 'image',
label_flag: str = 'z',
dynamic_range: bool = True):
"""
Initializes the data module with a dataset that is already loaded, setting up parameters
for data processing and batch loading.
Parameters:
- dataset (Dataset): The pre-loaded dataset, expected to be a torch.utils.data.Dataset with images in size B x C x H x W.
- batch_size (int): The size of each data batch for loading.
- num_workers (int): Number of subprocesses to use for data loading.
- test_size (float): The proportion of the dataset to reserve for testing.
- split_method (str): Strategy for splitting the dataset ('naive' implemented).
- loading (str): Approach for loading the dataset ('full' or 'iterated').
- feature_flag (str): The key in the dataset corresponding to the image data.
- label_flag (str): The key in the dataset corresponding to the redshift data.
- dynamic_range (bool): Flag indicating whether dynamic range compression should be applied.
"""

self.train_dataset = train_dataset
self.test_dataset = test_dataset
self.batch_size = batch_size
self.num_workers = num_workers
self.test_size = test_size
self.loading = loading
self.dynamic_range = dynamic_range
self.feature_flag = feature_flag
self.label_flag = label_flag

def prepare_data(self):
# Compute the dataset statistics
self.img_mean, self.img_std = compute_dataset_statistics(self.train_dataset, flag=self.feature_flag, loading=self.loading)
self.z_mean, self.z_std = compute_dataset_statistics(self.train_dataset, flag=self.label_flag, loading=self.loading)

# For correct broadcasting
self.img_mean, self.img_std = self.img_mean[:,None,None], self.img_std[:,None,None]

# Split the dataset
train_test_split = self.train_dataset.train_test_split(test_size=self.test_size)
self.train_dataset = train_test_split['train']
self.val_dataset = train_test_split['test']

def setup(self, stage=None):
pass

def collate_fn(self, batch):
batch = torch.utils.data.default_collate(batch)
x = normalize_sample(get_nested(batch, self.feature_flag), self.img_mean, self.img_std, dynamic_range=self.dynamic_range) # dynamic range compression and z-score normalization
y = normalize_sample(get_nested(batch, self.label_flag), self.z_mean, self.z_std, dynamic_range=False)
return x, y

def train_dataloader(self):
return DataLoader(self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)

def val_dataloader(self):
return DataLoader(self.val_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)

def test_dataloader(self):
return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)
Loading

0 comments on commit 3d4f8e4

Please sign in to comment.