Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use past PC search results and saved imagery #17

Merged
merged 23 commits into from
Aug 10, 2023
Merged

Conversation

klwetstone
Copy link
Collaborator

@klwetstone klwetstone commented Aug 9, 2023

closes #14
closes #16

14 - Using past PC search results

Uses past results from searching the planetary computer if available.

Config: There is now a pc_search_results_dir argument in the FeaturesConfig for users to point to a directory with PC search results.

  • We can point to our PC search results on S3, and avoid having to copy those files into the current cache directory.
  • The PC search results directory must contain sentinel_metadata.csv and sample_item_map.json

Example with directory specified

pc_results_dir = (
    AnyPath("s3://drivendata-competition-nasa-cyanobacteria")
    / "data/interim/full_pc_search"
)

pipeline = CyanoModelPipeline(
    features_config=FeaturesConfig(pc_search_results_dir=str(pc_results_dir)),
    model_training_config=ModelTrainingConfig(),
)
pipeline._prep_train_data(train_data_path, debug=True)
sat_meta = identify_satellite_data(
    pipeline.train_samples, pipeline.features_config, pipeline.cache_dir
)

Output:
2023-08-08 15:17:25.444 | INFO     | cyano.pipeline:_prep_train_data:49 - Loaded 10 samples for training
2023-08-08 15:17:25.445 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:159 - Generating metadata for all satellite item candidates
2023-08-08 15:17:30.480 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:171 - Loaded 56,173 rows of Sentinel candidate metadata from s3://drivendata-competition-nasa-cyanobacteria/data/interim/full_pc_search
2023-08-08 15:17:31.331 | INFO     | cyano.data.satellite_data:identify_satellite_data:277 - Selecting which items to use for feature generation
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 118.85it/s]
2023-08-08 15:17:31.421 | INFO     | cyano.data.satellite_data:identify_satellite_data:297 - Identified satellite imagery for 9 samples

Example without directory specified

The below searches the PC. It does not save out the PC search results anywhere to the cache_dir by default, because we often don't want this behavior. E.g., If we run prediction immediately after training as part of an experiment, we want to re-search the PC for the set of prediction samples. We don't want to use the results of searching for the training samples.

pipeline = CyanoModelPipeline(
    features_config=FeaturesConfig(),
    model_training_config=ModelTrainingConfig(),
)
pipeline._prep_train_data(train_data_path, debug=True)
sat_meta = identify_satellite_data(
    pipeline.train_samples, pipeline.features_config, pipeline.cache_dir
)

Output:
2023-08-08 15:50:12.560 | INFO     | cyano.pipeline:_prep_train_data:49 - Loaded 10 samples for training
2023-08-08 15:50:12.561 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:159 - Generating metadata for all satellite item candidates
2023-08-08 15:50:12.563 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:172 - Searching ['sentinel-2-l2a'] within 30 days and 1000 meters
100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:09<00:00,  1.08it/s]
2023-08-08 15:50:21.854 | INFO     | cyano.data.satellite_data:generate_candidate_metadata:205 - Generated metadata for 67 Sentinel item candidates
2023-08-08 15:50:21.855 | INFO     | cyano.data.satellite_data:identify_satellite_data:263 - Selecting which items to use for feature generation
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 180.43it/s]
2023-08-08 15:50:21.917 | INFO     | cyano.data.satellite_data:identify_satellite_data:283 - Identified satellite imagery for 9 samples

16 - Using saved image arrays

Saves image arrays in an organized way so we can check whether each image already exists, and only download it if not. Each array is identified by a unique combination of the feature window size (image_feature_meter_window in the features config), sample ID, pystac item ID, and band name.

New organization of image arrays is:

cache_dir 
├── sentinel_{image_feature_meter_window}
│   ├── {sample ID}
│   │   └── {item ID}
│   │       ├── {band name 1}.npy
│   │       ├── {band name 2}.npy
│   │       └── {band name 3}.npy
...

E.g. If image_feature_meter_window is 200 and use_sentinel_bands is ["B02", "B03"], the folder for sample 0856c3740614b5ee606f82d6c3a215a0 might look like:

cache_dir
├── sentinel_200
│   ├── 0856c3740614b5ee606f82d6c3a215a0
│   │   └── S2B_MSIL2A_20201104T160439_R097_T17SPV_20201106T105150
│   │       ├── B02.npy
│   │       └── B03.npy
...

My thought for where we throw errors is that any pystac item we are using must have all of the required bands, because some of the features use multiple bands. However, if a sample has no imagery, that does not raise an error.

Bonus

  • Updates the code to use only imagery from before a sample was collected
  • Adds a debug mode to ExperimentConfig.run_experiment, and to pipeline.run_prediction
  • Updates satellite_data.get_items_metadata based on learnings from running the full PC search.
  • Implements 3rd place's method of selecting satellite imagery (minus the time-intensive "quality" calculation)
  • Adds a test for saving image arrays and gets rid of extra test asset file

@klwetstone
Copy link
Collaborator Author

@ejm714 New PR that includes both using past PC search results and using saved image arrays

@ejm714
Copy link
Collaborator

ejm714 commented Aug 9, 2023

It does not save out the PC search results anywhere to the cache_dir by default, because we often don't want this behavior. E.g., If we run prediction immediately after training as part of an experiment, we want to re-search the PC for the set of prediction samples. We don't want to use the results of searching for the training samples.

We need to think more about this as this feels particularly fragile. It would be much preferred if we can do some logic in the code to figure out if a new search file needs to be generated, rather than relying on the user (for experiments: us) to reference files / move files around correctly.

Can we put something in the filename of the search results that contains info about the search parameters (e.g. a hash of the relevant config parameters)? Then we can check if a file exists with the same search parameters (e.g. generate the expect filename and see if it exists) and if the UIDs in that search file are the same as those in the input csv (e.g. train_csv or predict_csv) and then decide if we need to write out a new file. That way we can just put the search results file in the cache_dir which 1) lets us not need to add a new param to the config and 2) keeps the logic more consistent (e.g. if you don't specify a cache dir, everything is re-done from scratch)

I don't want to spend too much time on this since this ability to use a pre-saved file is likely just for us and won't stay in the production version. However what I've laid out above should be a small change. Another option is to put this parameter in the experiment config (a good place for things we don't want to keep in production) but I think that will be a clunkier implementation.

Copy link
Collaborator

@ejm714 ejm714 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick skim for now and flagged a couple needed changes

cyano/config.py Outdated Show resolved Hide resolved
cyano/experiment.py Outdated Show resolved Hide resolved
tests/test_features.py Outdated Show resolved Hide resolved
cyano/data/satellite_data.py Show resolved Hide resolved
@klwetstone
Copy link
Collaborator Author

All good points - I wasn't sure about this either. I'm still going through the full comment, but sharing one initial thought on:

That way we can just put the search results file in the cache_dir which 1) lets us not need to add a new param to the config and 2) keeps the logic more consistent (e.g. if you don't specify a cache dir, everything is re-done from scratch)

I think we may still want a parameter in the config. We either:

  1. Have a config parameter, which we can use to point to the directory on S3 with our full PC search results (s3://drivendata-competition-nasa-cyanobacteria/data/interim/full_pc_search), or
  2. Expect PC search results to always be in the same location within the cache dir. Then we don't need a new parameter, but we have to copy files from S3 to the current cache directory anytime we want to refer to the results of the full PC search.

I leaned towards option 1 because the files are large -- what do you think?

@ejm714
Copy link
Collaborator

ejm714 commented Aug 9, 2023

I think part of the problem is that the csv may have been the wrong unit. The unit is "{search_params}_{uid}". If we had saved out results in that unit, then when we go to make queries, we can say "did we do this already" and then either load from disk or make that query. This sort of structure is nice in that if we had 100 points in our first experiment and then add 20 more in the next experiment, we don't have to repeat the first 100.

Again, wary of over-optimizing for internal workflows, just sharing the thoughts this brings up for me.

@klwetstone
Copy link
Collaborator Author

Agreed with all of that! It didn't seem worth optimizing further yet, because currently our only use case for this functionality is experimentation, where everything will be a subset of the competition data anyways. There are certainly lots of ways we could do it though, and I agree that we'd want to save results on by-sample level rather than a whole csv.

One option is to keep as is for now, and file an issue for making the functionality (of using past PC search results) more broadly available to users down the road.

@ejm714
Copy link
Collaborator

ejm714 commented Aug 9, 2023

Given that this is purely for our own needs and since that file on s3 is not changing, let's just add a single line in the relevant location that says

if {path_to_metadata_on_s3}.exists():
   # download that to whatever location the query would output to
else:
   # do the pull for metadata

and then the rest of the code can be the same.

We can then file an issue to resolve that to its final state (either remove the check on s3 or build out functionality to support this for users).

In short, my preference is to hard code a link to that file rather than trying to configure it since it makes it easier to use and easier to remove. What do you think?

EDIT: it looks like we only load from disk once so that first if block can just load the data, not copy it to disk, so

if {path_to_metadata_on_s3}.exists():
   # load the data from s3
else:
   # do the pull for metadata

cyano/data/satellite_data.py Outdated Show resolved Hide resolved
@klwetstone
Copy link
Collaborator Author

I was thinking along the same lines of having a flag instead of specifying an exact path. What about if instead of checking whether the file exists, we have a flag in CyanoModelPipeline and ExperimentConfig for whether we are in "experimentation" mode?

If we use this if/else:

if {path_to_metadata_on_s3}.exists():
   # load the data from s3
else:
   # do the pull for metadata

It will always be true, because the file on s3 will always exist. Instead, we could hard-code the path to our past PC search results, and pull from it if we indicate that we're in "experiment" mode.

Thoughts?

@klwetstone
Copy link
Collaborator Author

@ejm714 Updated based on our pair earlier, ready for another look!

  • I changed the code to always download past PC results from s3, and found that our tests run a LOT slower (more than 3x slower). Searching the PC is much faster than downloading past results for small amounts of data. Given the amount of time it will add for us every time we run tests, I think it's worth adding a config argument. The FeaturesConfig is the easiest place to do this because it's already being passed to satellite_data.generate_candidate_metadata. We can still remove this later (per Remove functionality to use past PC search in package #18)

    • I also updated the relevant test assets
  • I also updated the test assets for predict_data.csv and evaluate_data.csv so they are subsets of our actual competition data. Otherwise the process will error if we do use past PC results with these assets.

Copy link
Collaborator

@ejm714 ejm714 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically there! Just going to be a stickler about something I got burned by in the past

cyano/config.py Outdated Show resolved Hide resolved
cyano/data/features.py Show resolved Hide resolved
cyano/data/satellite_data.py Outdated Show resolved Hide resolved
cyano/data/satellite_data.py Show resolved Hide resolved
cyano/data/satellite_data.py Outdated Show resolved Hide resolved
cyano/data/satellite_data.py Show resolved Hide resolved
cyano/data/satellite_data.py Outdated Show resolved Hide resolved
tests/assets/evaluate_data.csv Show resolved Hide resolved
tests/assets/experiment_config.yaml Outdated Show resolved Hide resolved
tests/test_features.py Outdated Show resolved Hide resolved
@ejm714 ejm714 self-requested a review August 10, 2023 19:28
Copy link
Collaborator

@ejm714 ejm714 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@ejm714 ejm714 merged commit 4e3f2a5 into main Aug 10, 2023
5 checks passed
@ejm714 ejm714 deleted the 14-16-use-past-results branch August 10, 2023 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use locally saved imagery if available Use past PC results data if available
2 participants