Skip to content

Latest commit

 

History

History
136 lines (119 loc) · 15 KB

README.md

File metadata and controls

136 lines (119 loc) · 15 KB

ET_agriculture

This repository is for an analysis of the effect of ET on agriculture as witnessed by differences in ECOSTRESS ET estimates over land cover and crop types. See report from JPL 2021 internship: https://docs.google.com/document/d/1xOjTGxKnkD_1tZrXiKZeYXzrA0Pmw3cboQdNA6kfm3U/edit?usp=sharing

Structure and Contents

This repository is organized into two folders. The data folder contains raw and intermediate datasets, as well as final datasets used in the analysis. The code folder consists of code to (1) download data, (2) process the data into the intermediate and final datasets, (3) conduct analyses on the final data. The code folder is numbered in order of folders that are completed sequentially, and the numbers listed next to the data folders and data files correspond to the number of the code file used to create it.

  • data (1,0) : all data is listed in the .gitignore and is thus not stored on github. To retrieve data in its final form without using the given code to direct the raw data from its original source, please refer to this drive folder

  • code: all code for data download, processing, and analysis.

    • 1_download_data:download data found in data/raw/

      • 0_repo_structure.R: create the basic repository structure
      • 1_download_shapefiles.R: Download county shapefile and create the required individual county shapefiles
      • 2_ECOSTRESS
        • 0_repo_structure.R: create the basic repository structure of data/raw/ECOSTRESS
        • 1_APPEEARS_requests: documentation on the appeears (https://lpdaacsvc.cr.usgs.gov/appeears/task/area) requests which can be copied to make the requests again
        • 2_download_links: links provided by appeears that need to be downloaded
        • 3_download_scripts:
          • by_request: a folder of scripts to download for each request
      • 2_ECOSTRESS_cv: same as 2_ECOSTRESS but for the smaller area of only the central valley
        • 0_repo_structure.R: create the basic repository structure of data/raw/ECOSTRESS
        • 1_APPEEARS_requests: documentation on the appeears (https://lpdaacsvc.cr.usgs.gov/appeears/task/area) requests which can be copied to make the requests again
        • 2_download_links: links provided by appeears that need to be downloaded
        • 3_download_scripts:
          • generic-download.sh: template to download the links listed in 2_download_links. Replace all instances of "https://..." with the correct links. Note to position oneself in the correct download folder when running these scripts (ex cd data/raw/ECOSTRESS) and then calling something like "bash ../../../code/1_download_data/2_ECOSTRESS/3_download_scripts/by_request/California-inst-PT-JPL-2-19-5-19.sh". Appeears account and password is needed.
          • by_request: a folder of scripts to download for each request
      • 3_CDL.R: Download years 2018-2020 of the Cropland Data Layer
      • 4_USGS_waterdata.R: Supposed to download the county level California irrigation data from the USGS. SCRIPT NOT FUNCTIONAL: DOWNLOAD MANUALLY and then break up into metadata (first few lines of data) and data
      • 5_DWR_crop.R: Download the 2018 crop map shapefile of California in 18 from the Department of Water Resources
    • 2_build_dataset: create data in data/intermediate and data/for_analysis

      • 1_intermediate: create data in data/intermediate
        • x_CDL_code_dictionary.R: create the code dictionary for crop types
        • 1_consistent_grid.R: create one consistent 70m grid for all data to be resampled to.
        • 2_agriculture.R: create a shapefile and raster of agriculture based on DWR data
        • 3_vegetation.R: create a shapefile and raster of the natural vegetation counterfacutal
        • 4_elevation_aspect_slope.R: create rasters of elevation, aspect, and slope from NED sampled to the consistent grid.
        • 5_soils.R: create a raster for the storie index resampled to the consistent grid
        • 6_PET.R: create a geotif rasterbrick of all the available PET data
        • 6.5_PET.py: take the output of 6_PET.R and resample it temporally to aggregate to necessary timesteps. Resample it to the common CA grid from 3_consistent_grid.R
        • 6.75_PET_grouped_avg.ipynb: average the output of 6.5_PET_yeargrouped.py across years such that you end up with a stack of only 6 images
        • 7_ECOSTRESS_resample_cv_parallel.R: resample each ET tif (central valley only) and its corresponding uncertainties to the CA_grid. Remove all data that don't have uncertainties.
        • 7.5_ECOSTRESS_subyears_cv_parallel.R: similar to 6.5_PET_yeargrouped_average.py, this takes the average of all tifs in a given time period.
        • 7_ECOSTRESS_scratch: I ended up trying to process the ECOSTRESS data in so many different ways that I created a scratch folder for the ways that didn't work out.
          • 7_ECOSTRESS.R: This file takes the ECOSTRESS data, resamples it to the consistent CA grid, and stacks it. It also creates an accompanying brick of uncertainties. If the uncertainties are missing, then that layer is simply NA. Note that this file can error out because it uses a lot of compute and memory, so I ended up running it in pieces as shown in the 7_ECOSTRESS folder.
          • 7.5_ECOSTRESS.py:take the output of 7_ECOSTRESS.R and resample it temporally to aggregate to necessary timesteps. Resample it to the common CA grid from 3_consistent_grid.R
          • 7_ECOSTRESS_OGmethod.R: This file generates a csv from all the ECOSTRESS data
          • 7_ECOSTRESS_resample.R: resample each ET tif and its corresponding uncertainties to the CA_grid. Remove all data that don't have uncertainties.
          • 7.5_ECOSTRESS_subyears.R: similar to 6.5_PET_yeargrouped_average.py, this takes the average of all tifs in a given time period.
        • 8_counties.R: make df of pixels with county
        • 9_crops.R: make df of pixels with crop type
      • 2_for_analysis: create data in data/for_analysis
        • 0_CV_clip.R: This script takes the full CA datasets and clips and masks them to just the central valley
        • 1_combine_intermediate-cv.py This script combines all the scripts created in 2,1 to make (1) a dataset of the full grid, (2) a dataset of only the counterfactual pixels, and (3) a dataset of only the ag pixels. The -cv flag indicates that it was done only for the central valley, clipped in 0_CV_clip.R.
        • 1.5_tidy_dataset_cv.R: This script takes the non-tidy full datasets created in 1_combine_intermediate-cv.py (the central valley cropped data) and makes them tidy for use in the random forest model. This is also the step where monthgroups 0, 1, and 5 are removed, so the study timeframe is only mid-may through mid-november. It also converts the ET unity to mm.
        • 2_random_sample_performance.py: This script takes random samples of the counterfactual dataset and tests the performance of the sklearn RF after leaving out a random 20% of the data. The goal of this script is to determine how much data are necessary in order to train a model without loosing too much information, mostly for hyperparameter tuning since that's super computationally expensive. It's encouraging that there isn't much of a dropoff because that means that we can tune on a small subset without loosing much information, but also this means that having neighnoring pixels might not matter that much for performance, which is encouraging for our counterfactual.
        • 3_hyperparameter_tuning.py: This script takes a subset of the counterfactual dataset (determined in 2_random_sample_performance.py) and finds better hyperparameters that we will use. Because tuning hyperparameters doesn't appear to add value, we do not tune hyperparameters for our model.
        • 4_model_validation.py: This script evaluates the model with leave out grid cells of various sizes (spatial cv)
        • 5_calculate_counterfactual.py: This script trains the model on the entire natural data (counterfactual) dataset. It is then applied to the agriculture dataset in order to get the counterfactual :)
    • 3_analysis: conduct analyses on final dataset(s)

      • 0_scratch: figures and tests done along the way that do not make it into the final results
        • 1_USGS_county_irrigation.Rmd: make some maps of the USGS county irrigation data
        • 2_CDL_DWR_USGS_crop_compare.Rmd: compare different land use data
      • 1_final: final results for the resulting paper. Written in Rmd; datasets used listed.