Use past PC search results and saved imagery #17

klwetstone · 2023-08-09T15:53:17Z

closes #14
closes #16

14 - Using past PC search results

Uses past results from searching the planetary computer if available.

Config: There is now a pc_search_results_dir argument in the FeaturesConfig for users to point to a directory with PC search results.

We can point to our PC search results on S3, and avoid having to copy those files into the current cache directory.
The PC search results directory must contain sentinel_metadata.csv and sample_item_map.json

Example with directory specified

pc_results_dir = (
    AnyPath("s3://drivendata-competition-nasa-cyanobacteria")
    / "data/interim/full_pc_search"
)

pipeline = CyanoModelPipeline(
    features_config=FeaturesConfig(pc_search_results_dir=str(pc_results_dir)),
    model_training_config=ModelTrainingConfig(),
)
pipeline._prep_train_data(train_data_path, debug=True)
sat_meta = identify_satellite_data(
    pipeline.train_samples, pipeline.features_config, pipeline.cache_dir
)

Output:

2023-08-08 15:17:25.444 | INFO     | cyano.pipeline:_prep_train_data:49 - Loaded 10 samples for training
2023-08-08 15:17:25.445 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:159 - Generating metadata for all satellite item candidates
2023-08-08 15:17:30.480 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:171 - Loaded 56,173 rows of Sentinel candidate metadata from s3://drivendata-competition-nasa-cyanobacteria/data/interim/full_pc_search
2023-08-08 15:17:31.331 | INFO     | cyano.data.satellite_data:identify_satellite_data:277 - Selecting which items to use for feature generation
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 118.85it/s]
2023-08-08 15:17:31.421 | INFO     | cyano.data.satellite_data:identify_satellite_data:297 - Identified satellite imagery for 9 samples

Example without directory specified

The below searches the PC. It does not save out the PC search results anywhere to the cache_dir by default, because we often don't want this behavior. E.g., If we run prediction immediately after training as part of an experiment, we want to re-search the PC for the set of prediction samples. We don't want to use the results of searching for the training samples.

pipeline = CyanoModelPipeline(
    features_config=FeaturesConfig(),
    model_training_config=ModelTrainingConfig(),
)
pipeline._prep_train_data(train_data_path, debug=True)
sat_meta = identify_satellite_data(
    pipeline.train_samples, pipeline.features_config, pipeline.cache_dir
)

Output:

2023-08-08 15:50:12.560 | INFO     | cyano.pipeline:_prep_train_data:49 - Loaded 10 samples for training
2023-08-08 15:50:12.561 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:159 - Generating metadata for all satellite item candidates
2023-08-08 15:50:12.563 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:172 - Searching ['sentinel-2-l2a'] within 30 days and 1000 meters
100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:09<00:00,  1.08it/s]
2023-08-08 15:50:21.854 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:205 - Generated metadata for 67 Sentinel item candidates
2023-08-08 15:50:21.855 | INFO     | cyano.data.satellite_data:identify_satellite_data:263 - Selecting which items to use for feature generation
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 180.43it/s]
2023-08-08 15:50:21.917 | INFO     | cyano.data.satellite_data:identify_satellite_data:283 - Identified satellite imagery for 9 samples

16 - Using saved image arrays

Saves image arrays in an organized way so we can check whether each image already exists, and only download it if not. Each array is identified by a unique combination of the feature window size (image_feature_meter_window in the features config), sample ID, pystac item ID, and band name.

New organization of image arrays is:

cache_dir 
├── sentinel_{image_feature_meter_window}
│   ├── {sample ID}
│   │   └── {item ID}
│   │       ├── {band name 1}.npy
│   │       ├── {band name 2}.npy
│   │       └── {band name 3}.npy
...

E.g. If image_feature_meter_window is 200 and use_sentinel_bands is ["B02", "B03"], the folder for sample 0856c3740614b5ee606f82d6c3a215a0 might look like:

cache_dir
├── sentinel_200
│   ├── 0856c3740614b5ee606f82d6c3a215a0
│   │   └── S2B_MSIL2A_20201104T160439_R097_T17SPV_20201106T105150
│   │       ├── B02.npy
│   │       └── B03.npy
...

My thought for where we throw errors is that any pystac item we are using must have all of the required bands, because some of the features use multiple bands. However, if a sample has no imagery, that does not raise an error.

Bonus

Updates the code to use only imagery from before a sample was collected
Adds a debug mode to ExperimentConfig.run_experiment, and to pipeline.run_prediction
Updates satellite_data.get_items_metadata based on learnings from running the full PC search.
Implements 3rd place's method of selecting satellite imagery (minus the time-intensive "quality" calculation)
Adds a test for saving image arrays and gets rid of extra test asset file

klwetstone · 2023-08-09T15:55:54Z

@ejm714 New PR that includes both using past PC search results and using saved image arrays

ejm714 · 2023-08-09T18:00:23Z

It does not save out the PC search results anywhere to the cache_dir by default, because we often don't want this behavior. E.g., If we run prediction immediately after training as part of an experiment, we want to re-search the PC for the set of prediction samples. We don't want to use the results of searching for the training samples.

We need to think more about this as this feels particularly fragile. It would be much preferred if we can do some logic in the code to figure out if a new search file needs to be generated, rather than relying on the user (for experiments: us) to reference files / move files around correctly.

Can we put something in the filename of the search results that contains info about the search parameters (e.g. a hash of the relevant config parameters)? Then we can check if a file exists with the same search parameters (e.g. generate the expect filename and see if it exists) and if the UIDs in that search file are the same as those in the input csv (e.g. train_csv or predict_csv) and then decide if we need to write out a new file. That way we can just put the search results file in the cache_dir which 1) lets us not need to add a new param to the config and 2) keeps the logic more consistent (e.g. if you don't specify a cache dir, everything is re-done from scratch)

I don't want to spend too much time on this since this ability to use a pre-saved file is likely just for us and won't stay in the production version. However what I've laid out above should be a small change. Another option is to put this parameter in the experiment config (a good place for things we don't want to keep in production) but I think that will be a clunkier implementation.

ejm714

Took a quick skim for now and flagged a couple needed changes

cyano/config.py

cyano/experiment.py

tests/test_features.py

cyano/data/satellite_data.py

klwetstone · 2023-08-09T18:12:02Z

All good points - I wasn't sure about this either. I'm still going through the full comment, but sharing one initial thought on:

That way we can just put the search results file in the cache_dir which 1) lets us not need to add a new param to the config and 2) keeps the logic more consistent (e.g. if you don't specify a cache dir, everything is re-done from scratch)

I think we may still want a parameter in the config. We either:

Have a config parameter, which we can use to point to the directory on S3 with our full PC search results (s3://drivendata-competition-nasa-cyanobacteria/data/interim/full_pc_search), or
Expect PC search results to always be in the same location within the cache dir. Then we don't need a new parameter, but we have to copy files from S3 to the current cache directory anytime we want to refer to the results of the full PC search.

I leaned towards option 1 because the files are large -- what do you think?

ejm714 · 2023-08-09T18:23:57Z

I think part of the problem is that the csv may have been the wrong unit. The unit is "{search_params}_{uid}". If we had saved out results in that unit, then when we go to make queries, we can say "did we do this already" and then either load from disk or make that query. This sort of structure is nice in that if we had 100 points in our first experiment and then add 20 more in the next experiment, we don't have to repeat the first 100.

Again, wary of over-optimizing for internal workflows, just sharing the thoughts this brings up for me.

klwetstone · 2023-08-09T19:49:56Z

Agreed with all of that! It didn't seem worth optimizing further yet, because currently our only use case for this functionality is experimentation, where everything will be a subset of the competition data anyways. There are certainly lots of ways we could do it though, and I agree that we'd want to save results on by-sample level rather than a whole csv.

One option is to keep as is for now, and file an issue for making the functionality (of using past PC search results) more broadly available to users down the road.

ejm714 · 2023-08-09T20:58:46Z

Given that this is purely for our own needs and since that file on s3 is not changing, let's just add a single line in the relevant location that says

if {path_to_metadata_on_s3}.exists():
   # download that to whatever location the query would output to
else:
   # do the pull for metadata

and then the rest of the code can be the same.

We can then file an issue to resolve that to its final state (either remove the check on s3 or build out functionality to support this for users).

In short, my preference is to hard code a link to that file rather than trying to configure it since it makes it easier to use and easier to remove. What do you think?

EDIT: it looks like we only load from disk once so that first if block can just load the data, not copy it to disk, so

if {path_to_metadata_on_s3}.exists():
   # load the data from s3
else:
   # do the pull for metadata

cyano/data/satellite_data.py

klwetstone · 2023-08-09T21:07:24Z

I was thinking along the same lines of having a flag instead of specifying an exact path. What about if instead of checking whether the file exists, we have a flag in CyanoModelPipeline and ExperimentConfig for whether we are in "experimentation" mode?

If we use this if/else:

if {path_to_metadata_on_s3}.exists():
   # load the data from s3
else:
   # do the pull for metadata

It will always be true, because the file on s3 will always exist. Instead, we could hard-code the path to our past PC search results, and pull from it if we indicate that we're in "experiment" mode.

Thoughts?

klwetstone · 2023-08-09T22:58:28Z

@ejm714 Updated based on our pair earlier, ready for another look!

I changed the code to always download past PC results from s3, and found that our tests run a LOT slower (more than 3x slower). Searching the PC is much faster than downloading past results for small amounts of data. Given the amount of time it will add for us every time we run tests, I think it's worth adding a config argument. The FeaturesConfig is the easiest place to do this because it's already being passed to satellite_data.generate_candidate_metadata. We can still remove this later (per Remove functionality to use past PC search in package #18)
- I also updated the relevant test assets
I also updated the test assets for predict_data.csv and evaluate_data.csv so they are subsets of our actual competition data. Otherwise the process will error if we do use past PC results with these assets.

ejm714

This is basically there! Just going to be a stickler about something I got burned by in the past

cyano/config.py

cyano/data/features.py

cyano/data/satellite_data.py

tests/assets/evaluate_data.csv

tests/assets/experiment_config.yaml

tests/test_features.py

ejm714

🎉

klwetstone mentioned this pull request Aug 9, 2023

Use past planetary computer search results #15

Closed

klwetstone requested a review from ejm714 August 9, 2023 15:53

ejm714 requested changes Aug 9, 2023

View reviewed changes

cyano/config.py Outdated Show resolved Hide resolved

cyano/experiment.py Outdated Show resolved Hide resolved

tests/test_features.py Outdated Show resolved Hide resolved

cyano/data/satellite_data.py Show resolved Hide resolved

ejm714 reviewed Aug 9, 2023

View reviewed changes

cyano/data/satellite_data.py Outdated Show resolved Hide resolved

klwetstone mentioned this pull request Aug 9, 2023

Remove functionality to use past PC search in package #18

Closed

2 tasks

ejm714 requested changes Aug 10, 2023

View reviewed changes

klwetstone added 16 commits August 10, 2023 12:53

start using existing pc search file

1d1ba1e

use existing satellite search results

f25436b

remove extra code

d9571eb

add documentation

b2fe974

update config

4b5c573

fix errors

46a9323

update image array saving

0b38bee

update test assets

67e981d

update features

3f1b52a

add debug mode to experiment

e3bbd4a

improve get_items_metadata

3f0f669

add test for image saving

fd8013e

remove extra cache_dir argument

bf72336

add experiment debug mode to config

ebf1f6d

update test_download_satellite_data

09f09a3

features config argument for using past pc results

d27a21d

klwetstone added 5 commits August 10, 2023 12:53

update tests

57b0ff7

remove selected column from sat meta

d8b6d68

remove use_past_pc_results from config

6047f81

check for missing bands earlier

2c82fe1

linting tweak

5371acc

klwetstone force-pushed the 14-16-use-past-results branch from f05a958 to 5371acc Compare August 10, 2023 17:53

klwetstone added 2 commits August 10, 2023 13:19

warn instead of raising error

3b0a023

change handling of missing item bands

d281fe3

ejm714 self-requested a review August 10, 2023 19:28

ejm714 approved these changes Aug 10, 2023

View reviewed changes

ejm714 merged commit 4e3f2a5 into main Aug 10, 2023
5 checks passed

ejm714 deleted the 14-16-use-past-results branch August 10, 2023 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use past PC search results and saved imagery #17

Use past PC search results and saved imagery #17

klwetstone commented Aug 9, 2023 •

edited

Loading

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023

ejm714 left a comment

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023 •

edited

Loading

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023 •

edited

Loading

klwetstone commented Aug 9, 2023

klwetstone commented Aug 9, 2023

ejm714 left a comment

ejm714 left a comment

Use past PC search results and saved imagery #17

Use past PC search results and saved imagery #17

Conversation

klwetstone commented Aug 9, 2023 • edited Loading

14 - Using past PC search results

16 - Using saved image arrays

Bonus

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023

ejm714 left a comment

Choose a reason for hiding this comment

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023 • edited Loading

klwetstone commented Aug 9, 2023

ejm714 commented Aug 9, 2023 • edited Loading

klwetstone commented Aug 9, 2023

klwetstone commented Aug 9, 2023

ejm714 left a comment

Choose a reason for hiding this comment

ejm714 left a comment

Choose a reason for hiding this comment

klwetstone commented Aug 9, 2023 •

edited

Loading

ejm714 commented Aug 9, 2023 •

edited

Loading

ejm714 commented Aug 9, 2023 •

edited

Loading