lake-temperature-model-prep

All key outputs from this pipeline are on google drive.

The following files are dependencies in pipeline #2 lake-temperature-process-models:

feature_crosswalk.rds and .ind
lakes_sf.rds and .ind
NLDAS_grid coords and cell resolution

The following files are dependencies in pipeline #3 lake-temperature-neural-networks:

merged_temp_data_daily.feather and .ind
feature_crosswalk.rds and .ind

If new cooperator data is added, and you'd like to trigger a rebuild:

scmake('coop_all_files', force = TRUE) and then scmake('6_temp_coop_fetch'). This should download the new files that have yet to be parsed.
scmake('7a_temp_coop_munge/log/7a_temp_coop_munge_tasks.ind', force = TRUE). This will fail if a cooperator data file does not have a parser - and will tell you which files need parsers. Add parsers to 7a_temp_coop_munge/src/data_parser. Parser functions must match file name, or must define rules to match files to parsers in find_parsers function here. This is useful if several files can use the same parser. If you add new scripts that contain parsers (e.g., don't add new functions to current scripts), then be sure to add the new files to the makefile sources
Once scmake('7a_temp_coop_munge/log/7a_temp_coop_munge_tasks.ind') runs successfully, final step is to run scmake('7b_temp_merge')

If any of these files are changed / updated in this pipeline, remember to:

copy the update .rds file to the dependent pipeline's drive (which is hyperlinked above) and to ALSO
copy the updated .ind file to the dependent pipeline's github repository (which is also hyperlinked above)

Adding new spatial data that uses a non-NHDHR identifier and generating a new crosswalk

When data (such as temperature, depth, clarity, etc) are referenced to a non-NHDHR identifier, we need to generate a new crosswalk file that shows the code how to link the site_id (NHD high-res ID) to the other identifier (often a state agency ID). The two common ways this crosswalk is generated is by using an approximate polygon overlap to consider two lakes "the same" or using a lat/lon sampling point from a lake with a point-in-polygon analysis.

Crosswalk: Alternative ID comes from polygon spatial data

Often a state agency has a shapefile (or other) that has polygons for each lake they study, as well as attributes that include the agency ID, which we'll want to crosswalk. Commonly, this shapefile doesn't come from an easy-to-code web service, as it is either emailed to us, or downloaded from a UI that doesn't make it easy to code against. (If, however, there is a service to code against, you can skip steps 1 and 2 and go right to step 3, but write your function to access the remote data and write to sf as .rds instead of opening a local file).

First, we upload the zip file that contains the raw/original data manually onto google drive to 1_crosswalk_fetch/in and then in the console, create an indicator file for that remote dataset using gd_confirm_posted("1_crosswalk_fetch/in/{my_filename}.zip.ind"). This only needs to be done once, and now the .ind file and the corresponding build file will appear locally and must now be checked into version control so other collaborators will have access to this information.
Next, we add the data_file (1_crosswalk_fetch/in/{my_filename}.zip) to the getters.yml, using the format

  1_crosswalk_fetch/in/{my_filename}.zip:
    command: gd_get('1_crosswalk_fetch/in/{my_filename}.zip.ind')

This will allow you or others to access this dataset to help build other targets that use it.

Next, convert this file into an sf object and save it as an .rds file in the 1_crosswalk_fetch.yml, using something along the lines of

  1_crosswalk_fetch/out/{stateXY_lakes}_sf.rds.ind:
    command: stateXY_zip_to_sf(
      out_ind = target_name,
      layer = I("stateXY_Lakes"),
      zip_ind = '1_crosswalk_fetch/in/{my_filename}.zip.ind')

You may need to create a custom function to do this, since each of the files seem to be different enough. This function should write a single file, which will be a MULTIPOLYGON sf simple feature collection with only one field (other than the spatial information in geometry), which will be site_id. site_id will be the data source alternative (non-NHDHR) identifier as a character. Inspect one of the existing similar targets in the repo to see this format if there is confusion (e.g., readRDS('1_crosswalk_fetch/out/LAGOS_lakes_sf.rds'))

Next, add the data_file to the getters.yml, using the format

  1_crosswalk_fetch/out/{stateXY_lakes}_sf.rds:
    command: gd_get('1_crosswalk_fetch/out/{stateXY_lakes}_sf.rds.ind')

The next set of steps will cover the creation of the ID crosswalk for this dataset, 2_crosswalk_munge/out/{stateXY}_nhdhr_xwalk.rds.ind:

Now, take advantage of a couple generic functions that use the standardized outputs of steps 3 & 4 to evaluate polygon overlap. The crosswalk_poly_intersect_poly() function is the first of these functions, and has inputs for the outgoing indicator file, the two polygons needed (one for your specific new dataset, and the other as the canonical source of NHDHR lakes), and the CRS that you'd like to perform the overlap analysis in. The poly1_ID_name argument is special - this is where you will define the site identifier naming convention for this crosswalk. This is the field/column name that will show up in the {stateXY}_nhdhr_xwalk.rds alongside the site_id for the NHDHR ID. I've been using uppercase letters for this argument, and it is going to be filled with the values of the alternative (non-NHDHR) identifier found in the site_id output of step #3. Here is what that new target recipe would look like:

  2_crosswalk_munge/out/{stateXY}_nhdhr_intersect.rds.ind:
    command: crosswalk_poly_intersect_poly(
      target_name,
      poly1_ind_file = '1_crosswalk_fetch/out/{stateXY_lakes}_sf.rds.ind',
      poly2_ind_file = '1_crosswalk_fetch/out/canonical_lakes_sf.rds.ind',
      poly1_ID_name = I("{stateXY}_ID"),
      crs=I(26915))

use the second generic polygon overlap function to choose a single polygon from intersected polygons from the file you just created a target for:

  2_crosswalk_munge/out/mglp_nhdhr_xwalk.rds.ind:
    command: choose1_poly_intersect_poly(
      target_name,
      intersect_ind_file = '2_crosswalk_munge/out/{stateXY}_nhdhr_intersect.rds.ind',
      poly1_ID_name = I("{stateXY}_ID"))

This is broken up into two functions because the expensive analysis happens in crosswalk_poly_intersect_poly() and if we ever want to change the selection criteria in choose1_poly_intersect_poly(), doing so would be pretty quick. When we wrote these functions I think we thought it would be likely we'd revisit this matching criteria in the future.

As before, we're using a shared cache, so add both targets to getters.yml:

  2_crosswalk_munge/out/{stateXY}_nhdhr_intersect.rds:
    command: gd_get('2_crosswalk_munge/out/{stateXY}_nhdhr_intersect.rds.ind')    
  2_crosswalk_munge/out/{stateXY}_nhdhr_xwalk.rds:
    command: gd_get('2_crosswalk_munge/out/{stateXY}_nhdhr_xwalk.rds.ind')

Add the new indicator file target from #6 to the depends of the 2_crosswalk_munge target:

targets:
  2_crosswalk_munge:
    depends:
      ...
      - 2_crosswalk_munge/out/{stateXY}_nhdhr_xwalk.rds.ind

Lastly, build scmake('2_crosswalk_munge') and check the new built files (and code files) into version control and create a PR. You will now be able to use the crosswalk file to link data from this set of identifiers into the modeling system. The original file you added in step #1 1_crosswalk_fetch/in/ should already be gitignored, but make sure this doesn't show up in your commits because we don't want to check large files (or really any data files) into github.

Crosswalk: Alternative ID comes from point-based spatial data (such as monitoring locations)

Point-based data that contain non-NHDHR identifiers is a bit simpler to set up. Similar to vector polygon data, some datasets need to be manually uploaded to drive and indicated prior to being able to process them (steps 1 and 2 from above) and other point-based data can be directly accessed via services, such as the water quality portal monitoring locations.

Skipping ahead to step #3 above (assuming you've building an sf object and saving it as an .rds), create a POINT simple feature collection with the required field site_id, other optional fields (e.g., "OrganizationIdentifier", and "resultCount" are in the 1_crosswalk_fetch/out/wqp_lake_secchi_sites_sf.rds file for diagnostic purposes), with the required geometry as well. Creating this .rds file (assumed here as 1_crosswalk_fetch/out/{pointXY}_sf.rds) and indicator file likely requires a custom function, see fetch_navico_points() for a simple example.

After this file and the .ind in getters.yml is created, you can move on to the generic function crosswalk_points_in_poly() to create the crosswalk file:

  2_crosswalk_munge/out/{pointXY}_nhdhr_xwalk.rds.ind:
    command: crosswalk_points_in_poly(target_name,
      poly_ind_file = '1_crosswalk_fetch/out/canonical_lakes_sf.rds.ind',
      points_ind_file = '1_crosswalk_fetch/out/{pointXY}_sf.rds.ind',
      points_ID_name = I("{POINTXY}_ID"))

As with above, the points_ID_name is what you want to refer to the crosswalk field in future use, this is not a field that matches an existing prior name.

Add the indicator file(s) to getters.yml

  1_crosswalk_fetch/out/{pointXY}_sf.rds:
    command: gd_get('1_crosswalk_fetch/out/{pointXY}_sf.rds.ind')    
  2_crosswalk_munge/out/{pointXY}_nhdhr_xwalk.rds:
    command: gd_get('2_crosswalk_munge/out/{pointXY}_nhdhr_xwalk.rds.ind')

Add the new indicator file target from #6 to the depends of the 2_crosswalk_munge target:

targets:
  2_crosswalk_munge:
    depends:
      ...
      - 2_crosswalk_munge/out/{pointXY}_nhdhr_xwalk.rds.ind

Lastly, build scmake('2_crosswalk_munge') and check the new built files (and code files) into version control and create a PR. You will now be able to use the crosswalk file to link data from this set of identifiers into the modeling system.

Tallgrass

Some targets need to be prepared here to support execution of mntoha-data-release:

1_crosswalk_fetch/out/canonical_lakes_sf.rds
7b_temp_merge/out/temp_data_with_sources.feather

To build these targets, we need certain R packages. Here's a recipe for creating a sufficient conda environment:

conda create -n lakes_prep
source activate lakes_prep
conda install -c conda-forge r-raster r-readxl r-doMC r-leaflet r-sys r-e1071 r-class r-KernSmooth r-askpass r-classInt r-DBI r-fs r-openssl r-sf r-units r-curl r-gargle r-httr r-purrr r-uuid r-devtools r-dplyr r-tidyselect r-BH r-plogr r-optparse r-storr r-getopt r-readr r-tidyr r-feather r-lwgeom r-maps r-ncdf4 r-lubridate r-generics
R
install.packages(c('smoothr','googledrive'))
devtools::install_github('USGS-R/lakeattributes')
devtools::install_github('richfitz/remake')
devtools::install_github('USGS-R/scipiper')
install.packages(c('dataRetrieval', 'sbtools'))

In subsequent sessions, we can get going with:

ssh tallgrass.cr.usgs.gov
cd /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-model-prep
source activate lakes_prep

...but after all that, I'm stuck now on a need for Drive authentication and am just going to build locally and push back up to this repo.

gd_get('1_crosswalk_fetch/out/canonical_lakes_sf.rds.ind')
gd_get('7b_temp_merge/out/temp_data_with_sources.feather.ind')

scp 1_crosswalk_fetch/out/canonical_lakes_sf.rds tallgrass.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-model-prep/1_crosswalk_fetch/out/
scp 7b_temp_merge/out/temp_data_with_sources.feather tallgrass.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-model-prep/7b_temp_merge/out/

Using the GCM driver data download pipeline

The GCM drivers data pipeline is a separate {{targets}} pipeline within this bigger scipiper one (see 7_drivers_munge/out/7_GCM_driver_files.ind). For the full footprint, we are running this pipeline on Tallgrass because it takes about 15 hours to complete.

Logon to Tallgrass and navigate to the lake-temperature-model-prep directory on Caldera. From there, you can either 1) kick off a non-interactive job using the singularity container by running sbatch gcm_drivers.slurm and then watch the progress of your build by running tail -f gcm_sing.out, or 2) work on in an interactive RStudio session by running sbatch launch-rstudio-container.slurm, running cat tmp/rstudio_[username].out and then following the startup instructions (note that you may need to manually create the tmp dir by running mkdir tmp).

Before either the interactive or non-interactive versions would work, I needed to first run googledriver::drive_auth() in R to authorize a Google account outside of the build since targets doesn't allow any interactive inputs during its pipeline build.

Name		Name	Last commit message	Last commit date
Latest commit History 1,465 Commits
.secrets		.secrets
1_crosswalk_fetch		1_crosswalk_fetch
2_crosswalk_munge		2_crosswalk_munge
3_params_fetch		3_params_fetch
4_params_munge		4_params_munge
6_drivers		6_drivers
6_nwis_fetch		6_nwis_fetch
6_temp_coop_fetch		6_temp_coop_fetch
6_wqp_fetch		6_wqp_fetch
7_config_merge		7_config_merge
7_drivers_munge		7_drivers_munge
7a_nwis_munge		7a_nwis_munge
7a_temp_coop_munge		7a_temp_coop_munge
7a_wqp_munge		7a_wqp_munge
7b_temp_merge		7b_temp_merge
8_viz		8_viz
build/status		build/status
docker		docker
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
1_crosswalk_fetch.yml		1_crosswalk_fetch.yml
2_crosswalk_munge.yml		2_crosswalk_munge.yml
3_params_fetch.yml		3_params_fetch.yml
4_params_munge.yml		4_params_munge.yml
6_drivers_fetch.yml		6_drivers_fetch.yml
6_nwis_fetch.yml		6_nwis_fetch.yml
6_temp_coop_fetch.yml		6_temp_coop_fetch.yml
6_wqp_fetch.yml		6_wqp_fetch.yml
7_config_merge.yml		7_config_merge.yml
7_drivers_munge.yml		7_drivers_munge.yml
7a_nwis_munge.yml		7a_nwis_munge.yml
7a_temp_coop_munge.yml		7a_temp_coop_munge.yml
7a_wqp_munge.yml		7a_wqp_munge.yml
7b_temp_merge.yml		7b_temp_merge.yml
8_viz.yml		8_viz.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER.md		DISCLAIMER.md
LICENSE.md		LICENSE.md
README.md		README.md
_targets.R		_targets.R
code.json		code.json
gcm_drivers.slurm		gcm_drivers.slurm
getters.yml		getters.yml
lake-temperature-model-prep.Rproj		lake-temperature-model-prep.Rproj
lake_data_map.Rmd		lake_data_map.Rmd
launch-rstudio-container.slurm		launch-rstudio-container.slurm
remake.yml		remake.yml
toha_lake_data_map.Rmd		toha_lake_data_map.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lake-temperature-model-prep

Adding new spatial data that uses a non-NHDHR identifier and generating a new crosswalk

Crosswalk: Alternative ID comes from polygon spatial data

Crosswalk: Alternative ID comes from point-based spatial data (such as monitoring locations)

Tallgrass

Using the GCM driver data download pipeline

About

Releases

Packages

Contributors 9

Languages

License

DOI-USGS/lake-temperature-model-prep

Folders and files

Latest commit

History

Repository files navigation

lake-temperature-model-prep

Adding new spatial data that uses a non-NHDHR identifier and generating a new crosswalk

Crosswalk: Alternative ID comes from polygon spatial data

Crosswalk: Alternative ID comes from point-based spatial data (such as monitoring locations)

Tallgrass

Using the GCM driver data download pipeline

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages