diff --git a/README.md b/README.md index bc46c90b..4d88e2e7 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,83 @@ -# Data Access -Running ngen requires building a standard run directory complete with the necessary files. Below is an explanation of the standard and an example that can be found [here](https://github.com/CIROH-UA/ngen-datastream/tree/main/data/standard_run) Reference for discussion of the standard [here](https://github.com/CIROH-UA/NGIAB-CloudInfra/pull/17). +# NextGen Datastream +The datastream automates the process of collecting and formatting input data for NextGen, orchestrating the NextGen run through NextGen In a Box (NGIAB), and handling outputs. In it's current implementation, the datastream is a shell script that orchestrates each step in the process. -An ngen run directory `data_dir` is composed of three necessary subfolders `config, forcings, outputs` and an optional fourth subfolder `metadata`. `data_dir` may have any name, but the subfolders must follow this naming convention. +## Install +Just clone this repo, the stream will handle initialization and installation of the datastream tools. +## Run it +``` +/ngen-datastream/scripts/stream.sh ./configs/conf_datastream.json +``` + +## Formatting `conf_datastream.json` +### globals +| Field | Description | Required | +|-------------------|--------------------------|------| +| start_time | Start simulation time (YYYYMMDDHHMM) | :white_check_mark: | +| end_time | End simulation time (YYYYMMDDHHMM) | :white_check_mark: | +| data_dir | Name used in constructing the parent directory of the datastream. Must not exist prior to datastream run | :white_check_mark: | +| resource_dir | Folder name that contains the datastream resources. If not provided, datastream will create this folder with default options | | +| relative_path | Absolute path to be prepended to any other path given in configuration file | | +| subset_id | catchment id to subset. If not provided, the geopackage in the resource_dir will define the spatial domain in entirety | Required only if resource_dir is not given | + +### Example `conf_datastream.json` +``` +{ + "globals" : { + "start_date" : "", + "end_date" : "", + "data_dir" : "ngen-datastream-test", + "resource_dir" : "datastream-resources-dev", + "relative_to" : "/home/jlaser/code/CIROH/ngen-datastream/data" + "subset_id" : "" + } +} +``` + +## NextGen Datastream Directory Stucture ``` data_dir/ │ +├── datastream-configs/ +│ +├── datastream-resources/ +| +├── ngen-run/ +``` +`datastream-configs/` holds the all the configuration files the datastream needs in order to run. Note! The datastream can modify `conf_datastream.json` and generate it's own internal configs. `datastream-configs/` is the first place to look to confirm that a datastream run has been executed according to the user's specifications. +Example directory: +``` +datastream-configs/ +│ +├── conf_datastream.json +│ +├── conf_forcingprocessor.json +| +├── conf_nwmurl.json +``` +`datastream-resources/` holds the data files required to perform computations required by the datastream. The user can supply this directory by pointing the configuration file to `resource_dir`. If not given by the user, datastream will generate this folder with these [defaults](#resource_dir). +│ +├── conf_datastream.json +│ +├── conf_forcingprocessor.json +| +├── conf_nwmurl.json +`ngen-run` follows the directory structure described [here](#nextgen-run-directory-structure) + +### resource_dir +TODO: explain defualts used in automated build + +### Useful Hacks +TODO: Daily + +## NextGen Run Directory Structure +Running ngen requires building a standard run directory complete with the necessary files. The datastream constructs this automatically. Below is an explanation of the standard. Reference for discussion of the standard [here](https://github.com/CIROH-UA/NGIAB-CloudInfra/pull/17). + +An ngen run directory `ngen-run` is composed of three necessary subfolders `config, forcings, outputs` and an optional fourth subfolder `metadata`. + +``` +ngen-run/ +│ ├── config/ │ ├── forcings/ @@ -15,7 +87,7 @@ data_dir/ ├── outputs/ ``` -The `data_dir` directory contains the following subfolders: +The `ngen-run` directory contains the following subfolders: - `config`: model configuration files and hydrofabric configuration files. A deeper explanation [here](#Configuration-directory) - `forcings`: catchment-level forcing timeseries files. These can be generated with the [forcingprocessor](https://github.com/CIROH-UA/ngen-datastream/tree/main/forcingprocessor). Forcing files contain variables like wind speed, temperature, precipitation, and solar radiation. @@ -23,7 +95,7 @@ The `data_dir` directory contains the following subfolders: - `outputs`: This is where ngen will place the output files. ### Configuration directory -`data_dir/config/` +`ngen-run/config/` . `realization.json` : The realization file serves as the primary model configuration for the ngen framework. An example can be found [here](https://github.com/CIROH-UA/ngen-datastream/tree/main/data/standard_run/config/realization.json). This file specifies which models/modules to run and with which parameters, run parameters like date and time, and hydrofabric specifications. @@ -34,6 +106,6 @@ These files contain the [hydrofabric](https://mikejohnson51.github.io/hyAggregat Other files may be placed in this subdirectory that relate to internal-ngen-models/modules. It is common to define variables like soil parameters in these files for ngen modules to use. ## Versioning -The ngen framework uses a merkel tree hashing algorithm to version each ngen run. This means that the changes a user makes to any input files in `data_dir` will be tracked and diff'd against previous input directories. While an explaination of how awesome this is can be found elsewhere, the important thing to know is the user must prepare a clean input directory (`data_dir`) for each run they want to make. +The ngen framework uses a merkel tree hashing algorithm to version each ngen run. This means that the changes a user makes to any input files in `ngen-run` will be tracked and diff'd against previous input directories. While an explaination of how awesome this is can be found elsewhere, the important thing to know is the user must prepare a clean input directory (`ngen-run`) for each run they want to make. -"Clean" means here that every file in the `data_dir` is required for the immediate run the user intends to make. For instance, if the user creates a new realization configuration file, the old file must be removed before using `data_dir` as an input directory to ngen. In other words, each configuration file type (realization, catchment, nexus, etc.) must be unique within `data_dir`. +"Clean" means here that every file in the `ngen-run` is required for the immediate run the user intends to make. For instance, if the user creates a new realization configuration file, the old file must be removed before using `ngen-run` as an input directory to ngen. In other words, each configuration file type (realization, catchment, nexus, etc.) must be unique within `ngen-run`. diff --git a/scripts/stream.sh b/scripts/stream.sh index bdeb0957..4897131e 100755 --- a/scripts/stream.sh +++ b/scripts/stream.sh @@ -34,6 +34,20 @@ RELATIVE_TO=$(echo "$config" | jq -r '.globals.relative_to') GEOPACKAGE=$(echo "$config" | jq -r '.hydrofabric.geopackage') SUBSET_ID=$(echo "$config" | jq -r '.hydrofabric.subset_id') +if [ ! -n "$RESOURCE_PATH" ]; then + echo "Generating datastream resources with defaults" + if [ -n "$SUBSET_ID" ]; then + echo "Subsetting with $SUBSET_ID" + else + echo "If no resource_path is provided, user must supply subset_id (Set to CONUS if desired)" + exit 1 + fi + echo "Not implemented, need to make your own for now." + #wget nwm_example_grid_file.nc + #wget geopackage (probably using hfsubset) + #wget ngen-configs (pull from s3 for now, generate these dynamically in the future.) +fi + if [ -n "$RELATIVE_TO" ] && [ -n "$DATA_PATH" ]; then echo "Prepending ${RELATIVE_TO} to ${DATA_PATH#/}" DATA_PATH="${RELATIVE_TO%/}/${DATA_PATH%/}" @@ -158,23 +172,22 @@ docker run -it --rm -v "$DATA_PATH:"$DOCKER_MOUNT"" \ -w "$DOCKER_RESOURCES" $DOCKER_TAG \ python "$DOCKER_FP_PATH"forcingprocessor.py "$DOCKER_CONFIGS"/conf_fp.json -TAR_NAME="ngen-run.tar.gz" -TAR_PATH="${DATA_PATH%/}/$TAR_NAME" -tar -czf $TAR_PATH -C $NGEN_RUN_PATH . - DOCKER_TAG="validator" VAL_DOCKER="${DOCKER_DIR%/}/validator" build_docker_container "$DOCKER_TAG" "$VAL_DOCKER" -TARBALL_DOCKER="${DOCKER_MOUNT%/}""/$TAR_NAME" -docker run -it --rm -v "$DATA_PATH":"$DOCKER_MOUNT" \ +docker run -it --rm -v "$NGEN_RUN_PATH":"$DOCKER_MOUNT" \ validator python /ngen-cal/python/run_validator.py \ - --tarball $TARBALL_DOCKER + --data_dir $DOCKER_MOUNT # ngen run # hashing -# docker run --rm -it -v "$DATA_PATH":/data zwills/ht ./ht --fmt=tree /data +# docker run --rm -it -v "$NGEN_RUN_PATH":/data zwills/ht ./ht --fmt=tree /data + +TAR_NAME="ngen-run.tar.gz" +TAR_PATH="${DATA_PATH%/}/$TAR_NAME" +tar -czf $TAR_PATH -C $NGEN_RUN_PATH . # manage outputs # aws s3 sync $DATA_PATH $SOME_BUCKET_NAME