Skip to content

Commit

Permalink
update readme and process defaults
Browse files Browse the repository at this point in the history
  • Loading branch information
JordanLaserGit committed Jan 11, 2024
1 parent 0640fe2 commit 9e52c4b
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 17 deletions.
25 changes: 16 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@
The datastream automates the process of collecting and formatting input data for NextGen, orchestrating the NextGen run through NextGen In a Box (NGIAB), and handling outputs. In it's current implementation, the datastream is a shell script that orchestrates each step in the process.

## Install
If you'd like to just run the stream, clone this repo. The stream will handle initialization and installation of the datastream tools. To utilize the individual tools in the stream, see their respective readme's for installation instructions.
If you'd like to run the stream, clone this repo and execute the command below. The stream will handle initialization and installation of the datastream tools. To utilize the individual tools in the stream, see their respective readme's for installation instructions.

## Run it
```
/ngen-datastream/scripts/stream.sh /ngen-datastream/configs/conf_datastream_daily.json
```
requires `jq` and `wget`
also requires `pip install pytz`

### Example `conf_datastream.json`
```
Expand All @@ -20,7 +19,12 @@ also requires `pip install pytz`
"data_dir" : "ngen-datastream-test",
"resource_dir" : "datastream-resources-dev",
"relative_to" : "/ngen-datastream/data"
"subset_id" : ""
},
"subset" :{
"id_type" : "",
"id" : "",
"version" : "",
}
}
```
Expand All @@ -32,9 +36,11 @@ also requires `pip install pytz`
| start_time | Start simulation time (YYYYMMDDHHMM) | :white_check_mark: |
| end_time | End simulation time (YYYYMMDDHHMM) | :white_check_mark: |
| data_dir | Name used in constructing the parent directory of the datastream. Must not exist prior to datastream run | :white_check_mark: |
| resource_dir | Folder name that contains the datastream resources. If not provided, datastream will create this folder with default options | |
| resource_dir | Folder name that contains the datastream resources. If not provided, datastream will create this folder with [default options](#datastream-resources-defaults) | |
| relative_path | Absolute path to be prepended to any other path given in configuration file | |
| subset_id | catchment id to subset. If not provided, the geopackage in the resource_dir will define the spatial domain in entirety | |
| id_type | id type corresponding to "id" [See hfsubset for options](https://github.com/LynkerIntel/hfsubset) | |
| id | catchment id to subset. If not provided, spatial domain is set to CONUS [See hfsubset for options](https://github.com/LynkerIntel/hfsubset) | |
| version [See hfsubset for options](https://github.com/LynkerIntel/hfsubset) | hydrofabric version | |

## NextGen Datastream Directory Stucture
```
Expand All @@ -61,7 +67,7 @@ datastream-configs/
├── conf_nwmurl.json
```
### `datastream-resources/`
Copied into `data_dir` if user supplied, generated if not. Holds the data files required to perform computations required by the datastream. The user can supply this directory by pointing the configuration file to `resource_dir`. If not given by the user, datastream will generate this folder with these [defaults](#resource_dir). If the user executes the stream in this way, there is no control over the spatial domain. This option is intended for demonstration purposes only.
Copied into `data_dir` if user supplied, generated with defaults if not. Holds the data files required to perform computations required by the datastream. The user can supply this directory by pointing the configuration file to `resource_dir`. If not given by the user, datastream will generate this folder with these [defaults](#resource_dir). If the user executes the stream in this way, there is no control over the spatial domain.
```
datastream-resources/
Expand All @@ -71,14 +77,15 @@ datastream-resources/
|
├── <nwm-example-grid-file>.nc
```
`ngen-configs/` holds all non-hydrofabric configuration files for NextGen (`realizion.json`,`config.ini`)
#### `ngen-configs/` holds all non-hydrofabric configuration files for NextGen (`realizion.json`,`config.ini`)

`datastream-resources/` Defaults
#### `datastream-resources/` Defaults
```
GRID_FILE_DEFAULT="https://ngenresourcesdev.s3.us-east-2.amazonaws.com/nwm.t00z.short_range.forcing.f001.conus.nc"
NGEN_CONF_DEFAULT="https://ngenresourcesdev.s3.us-east-2.amazonaws.com/ngen-run-pass/configs/config.ini"
NGEN_REAL_DEFAULT="https://ngenresourcesdev.s3.us-east-2.amazonaws.com/ngen-run-pass/configs/realization.json"
GEOPACKAGE_DEFAULT="https://lynker-spatial.s3.amazonaws.com/v20.1/gpkg/nextgen_01.gpkg"
WEIGHTS_DEFAULT="https://ngenresourcesdev.s3.us-east-2.amazonaws.com/weights_conus_v21.json"https://lynker-spatial.s3.amazonaws.com/v20.1/conus.gpkg
```

### Useful Hacks
Expand Down
8 changes: 4 additions & 4 deletions forcingprocessor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ See the docker README for example run commands from the container.
|-------------------|--------------------------------|----------|
| verbose | Get print statements, defaults to false | :white_check_mark: |
| collect_stats | Collect forcing metadata, defaults to true | :white_check_mark: |
| proc_process | Number of data processing threads, defaults to 80% available cores | |
| write_process | Number of writing threads, defaults to 100% available cores | |
| proc_process | Number of data processing processes, defaults to 50% available cores | |
| write_process | Number of writing processes, defaults to 100% available cores | |
| nfile_chunk | Number of files to process each write, defaults to 1000000. Only set this if experiencing memory constraints due to large number of nwm forcing files | |

## nwm_file
Expand Down Expand Up @@ -91,7 +91,7 @@ In order to retrieve forcing data from a NWM grid for a given catchment, the ind
python weight_generator.py <path to geopackage> <path to output weights to> <path to example NWM forcing file>
```

The weight generator will input an example NWM forcing netcdf to reference the NWM grid, a geopackage that contains all of the catchments the user wants weights for, and a file name for the weight file. Subsetted geopackages can be made with [hfsubset](https://github.com/LynkerIntel/hfsubset). Python based subsetting tools are available [here](https://github.com/CIROH-UA/ngen-datastream/tree/main/subsetting), but plans exist to deprecate this as functionality is built out in hfsubset.
The weight generator will input an example NWM forcing netcdf to reference the NWM grid, a geopackage that contains all of the catchments the user wants weights for, and a file name for the weight file. Subsetted geopackages can be made with [hfsubset](https://github.com/LynkerIntel/hfsubset).

## Run Notes
This tool is CPU, memory, and I/O intensive. For the best performance, run with `proc_threads` equal to than half of available cores and `write_threads` equal to the number of available cores. Best to experiment with your resources to find out what works best. These options default to 80% and 100% available cores respectively.
This tool is CPU, memory, and I/O intensive. For the best performance, run with `proc_process` equal to than half of available cores and `write_threads` equal to the number of available cores. Best to experiment with your resources to find out what works best. These options default to 50% and 100% available cores respectively.
2 changes: 1 addition & 1 deletion forcingprocessor/src/forcingprocessor/forcingprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ def prep_ngen_data(conf):
write_process = conf["run"].get("write_process",None)
nfile_chunk = conf["run"].get("nfile_chunk",None)

if proc_process is None: proc_process = int(os.cpu_count() * 0.8)
if proc_process is None: proc_process = int(os.cpu_count() * 0.5)
if write_process is None: write_process = os.cpu_count()
if nfile_chunk is None: nfile_chunk = 100000

Expand Down
9 changes: 6 additions & 3 deletions python/configure-datastream.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@ def create_ds_confs_daily(conf, today, tomorrow):
"output_file_type" : "csv",
},
"run" : {
"verbose" : True,
"collect_stats" : True
"verbose" : True,
"collect_stats" : True,
"proc_process" : int(os.cpu_count() * 0.8),
"write_process" : os.cpu_count()
}
}

Expand All @@ -43,7 +45,8 @@ def create_ds_confs_daily(conf, today, tomorrow):
"meminput" : 0,
"urlbaseinput" : 7,
"fcst_cycle" : [0],
"lead_time" : [x+1 for x in range(24)]
'lead_time' : [1]
# "lead_time" : [x+1 for x in range(24)]
}

conf['forcingprcoessor'] = fp_conf
Expand Down

0 comments on commit 9e52c4b

Please sign in to comment.