Skip to content

Commit

Permalink
stream documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
JordanLaserGit committed Dec 11, 2023
1 parent 3a40649 commit abc2d8d
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 15 deletions.
86 changes: 79 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,83 @@
# Data Access
Running ngen requires building a standard run directory complete with the necessary files. Below is an explanation of the standard and an example that can be found [here](https://github.com/CIROH-UA/ngen-datastream/tree/main/data/standard_run) Reference for discussion of the standard [here](https://github.com/CIROH-UA/NGIAB-CloudInfra/pull/17).
# NextGen Datastream
The datastream automates the process of collecting and formatting input data for NextGen, orchestrating the NextGen run through NextGen In a Box (NGIAB), and handling outputs. In it's current implementation, the datastream is a shell script that orchestrates each step in the process.

An ngen run directory `data_dir` is composed of three necessary subfolders `config, forcings, outputs` and an optional fourth subfolder `metadata`. `data_dir` may have any name, but the subfolders must follow this naming convention.
## Install
Just clone this repo, the stream will handle initialization and installation of the datastream tools.

## Run it
```
/ngen-datastream/scripts/stream.sh ./configs/conf_datastream.json
```

## Formatting `conf_datastream.json`
### globals
| Field | Description | Required |
|-------------------|--------------------------|------|
| start_time | Start simulation time (YYYYMMDDHHMM) | :white_check_mark: |
| end_time | End simulation time (YYYYMMDDHHMM) | :white_check_mark: |
| data_dir | Name used in constructing the parent directory of the datastream. Must not exist prior to datastream run | :white_check_mark: |
| resource_dir | Folder name that contains the datastream resources. If not provided, datastream will create this folder with default options | |
| relative_path | Absolute path to be prepended to any other path given in configuration file | |
| subset_id | catchment id to subset. If not provided, the geopackage in the resource_dir will define the spatial domain in entirety | Required only if resource_dir is not given |

### Example `conf_datastream.json`
```
{
"globals" : {
"start_date" : "",
"end_date" : "",
"data_dir" : "ngen-datastream-test",
"resource_dir" : "datastream-resources-dev",
"relative_to" : "/home/jlaser/code/CIROH/ngen-datastream/data"
"subset_id" : ""
}
}
```

## NextGen Datastream Directory Stucture
```
data_dir/
├── datastream-configs/
├── datastream-resources/
|
├── ngen-run/
```
`datastream-configs/` holds the all the configuration files the datastream needs in order to run. Note! The datastream can modify `conf_datastream.json` and generate it's own internal configs. `datastream-configs/` is the first place to look to confirm that a datastream run has been executed according to the user's specifications.
Example directory:
```
datastream-configs/
├── conf_datastream.json
├── conf_forcingprocessor.json
|
├── conf_nwmurl.json
```
`datastream-resources/` holds the data files required to perform computations required by the datastream. The user can supply this directory by pointing the configuration file to `resource_dir`. If not given by the user, datastream will generate this folder with these [defaults](#resource_dir).
├── conf_datastream.json
├── conf_forcingprocessor.json
|
├── conf_nwmurl.json
`ngen-run` follows the directory structure described [here](#nextgen-run-directory-structure)

### resource_dir
TODO: explain defualts used in automated build

### Useful Hacks
TODO: Daily

## NextGen Run Directory Structure
Running ngen requires building a standard run directory complete with the necessary files. The datastream constructs this automatically. Below is an explanation of the standard. Reference for discussion of the standard [here](https://github.com/CIROH-UA/NGIAB-CloudInfra/pull/17).

An ngen run directory `ngen-run` is composed of three necessary subfolders `config, forcings, outputs` and an optional fourth subfolder `metadata`.

```
ngen-run/
├── config/
├── forcings/
Expand All @@ -15,15 +87,15 @@ data_dir/
├── outputs/
```

The `data_dir` directory contains the following subfolders:
The `ngen-run` directory contains the following subfolders:

- `config`: model configuration files and hydrofabric configuration files. A deeper explanation [here](#Configuration-directory)
- `forcings`: catchment-level forcing timeseries files. These can be generated with the [forcingprocessor](https://github.com/CIROH-UA/ngen-datastream/tree/main/forcingprocessor). Forcing files contain variables like wind speed, temperature, precipitation, and solar radiation.
- `metadata` is an optional subfolder. This is programmatically generated and it used within to ngen. Do not edit this folder.
- `outputs`: This is where ngen will place the output files.

### Configuration directory
`data_dir/config/`
`ngen-run/config/`
.
`realization.json` :
The realization file serves as the primary model configuration for the ngen framework. An example can be found [here](https://github.com/CIROH-UA/ngen-datastream/tree/main/data/standard_run/config/realization.json). This file specifies which models/modules to run and with which parameters, run parameters like date and time, and hydrofabric specifications.
Expand All @@ -34,6 +106,6 @@ These files contain the [hydrofabric](https://mikejohnson51.github.io/hyAggregat
Other files may be placed in this subdirectory that relate to internal-ngen-models/modules. It is common to define variables like soil parameters in these files for ngen modules to use.

## Versioning
The ngen framework uses a merkel tree hashing algorithm to version each ngen run. This means that the changes a user makes to any input files in `data_dir` will be tracked and diff'd against previous input directories. While an explaination of how awesome this is can be found elsewhere, the important thing to know is the user must prepare a clean input directory (`data_dir`) for each run they want to make.
The ngen framework uses a merkel tree hashing algorithm to version each ngen run. This means that the changes a user makes to any input files in `ngen-run` will be tracked and diff'd against previous input directories. While an explaination of how awesome this is can be found elsewhere, the important thing to know is the user must prepare a clean input directory (`ngen-run`) for each run they want to make.

"Clean" means here that every file in the `data_dir` is required for the immediate run the user intends to make. For instance, if the user creates a new realization configuration file, the old file must be removed before using `data_dir` as an input directory to ngen. In other words, each configuration file type (realization, catchment, nexus, etc.) must be unique within `data_dir`.
"Clean" means here that every file in the `ngen-run` is required for the immediate run the user intends to make. For instance, if the user creates a new realization configuration file, the old file must be removed before using `ngen-run` as an input directory to ngen. In other words, each configuration file type (realization, catchment, nexus, etc.) must be unique within `ngen-run`.
29 changes: 21 additions & 8 deletions scripts/stream.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,20 @@ RELATIVE_TO=$(echo "$config" | jq -r '.globals.relative_to')
GEOPACKAGE=$(echo "$config" | jq -r '.hydrofabric.geopackage')
SUBSET_ID=$(echo "$config" | jq -r '.hydrofabric.subset_id')

if [ ! -n "$RESOURCE_PATH" ]; then
echo "Generating datastream resources with defaults"
if [ -n "$SUBSET_ID" ]; then
echo "Subsetting with $SUBSET_ID"
else
echo "If no resource_path is provided, user must supply subset_id (Set to CONUS if desired)"
exit 1
fi
echo "Not implemented, need to make your own for now."
#wget nwm_example_grid_file.nc
#wget geopackage (probably using hfsubset)
#wget ngen-configs (pull from s3 for now, generate these dynamically in the future.)
fi

if [ -n "$RELATIVE_TO" ] && [ -n "$DATA_PATH" ]; then
echo "Prepending ${RELATIVE_TO} to ${DATA_PATH#/}"
DATA_PATH="${RELATIVE_TO%/}/${DATA_PATH%/}"
Expand Down Expand Up @@ -158,23 +172,22 @@ docker run -it --rm -v "$DATA_PATH:"$DOCKER_MOUNT"" \
-w "$DOCKER_RESOURCES" $DOCKER_TAG \
python "$DOCKER_FP_PATH"forcingprocessor.py "$DOCKER_CONFIGS"/conf_fp.json

TAR_NAME="ngen-run.tar.gz"
TAR_PATH="${DATA_PATH%/}/$TAR_NAME"
tar -czf $TAR_PATH -C $NGEN_RUN_PATH .

DOCKER_TAG="validator"
VAL_DOCKER="${DOCKER_DIR%/}/validator"
build_docker_container "$DOCKER_TAG" "$VAL_DOCKER"

TARBALL_DOCKER="${DOCKER_MOUNT%/}""/$TAR_NAME"
docker run -it --rm -v "$DATA_PATH":"$DOCKER_MOUNT" \
docker run -it --rm -v "$NGEN_RUN_PATH":"$DOCKER_MOUNT" \
validator python /ngen-cal/python/run_validator.py \
--tarball $TARBALL_DOCKER
--data_dir $DOCKER_MOUNT

# ngen run

# hashing
# docker run --rm -it -v "$DATA_PATH":/data zwills/ht ./ht --fmt=tree /data
# docker run --rm -it -v "$NGEN_RUN_PATH":/data zwills/ht ./ht --fmt=tree /data

TAR_NAME="ngen-run.tar.gz"
TAR_PATH="${DATA_PATH%/}/$TAR_NAME"
tar -czf $TAR_PATH -C $NGEN_RUN_PATH .

# manage outputs
# aws s3 sync $DATA_PATH $SOME_BUCKET_NAME

0 comments on commit abc2d8d

Please sign in to comment.