-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use past PC search results and saved imagery #17
Conversation
@ejm714 New PR that includes both using past PC search results and using saved image arrays |
We need to think more about this as this feels particularly fragile. It would be much preferred if we can do some logic in the code to figure out if a new search file needs to be generated, rather than relying on the user (for experiments: us) to reference files / move files around correctly. Can we put something in the filename of the search results that contains info about the search parameters (e.g. a hash of the relevant config parameters)? Then we can check if a file exists with the same search parameters (e.g. generate the expect filename and see if it exists) and if the UIDs in that search file are the same as those in the input csv (e.g. train_csv or predict_csv) and then decide if we need to write out a new file. That way we can just put the search results file in the cache_dir which 1) lets us not need to add a new param to the config and 2) keeps the logic more consistent (e.g. if you don't specify a cache dir, everything is re-done from scratch) I don't want to spend too much time on this since this ability to use a pre-saved file is likely just for us and won't stay in the production version. However what I've laid out above should be a small change. Another option is to put this parameter in the experiment config (a good place for things we don't want to keep in production) but I think that will be a clunkier implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick skim for now and flagged a couple needed changes
All good points - I wasn't sure about this either. I'm still going through the full comment, but sharing one initial thought on:
I think we may still want a parameter in the config. We either:
I leaned towards option 1 because the files are large -- what do you think? |
I think part of the problem is that the csv may have been the wrong unit. The unit is "{search_params}_{uid}". If we had saved out results in that unit, then when we go to make queries, we can say "did we do this already" and then either load from disk or make that query. This sort of structure is nice in that if we had 100 points in our first experiment and then add 20 more in the next experiment, we don't have to repeat the first 100. Again, wary of over-optimizing for internal workflows, just sharing the thoughts this brings up for me. |
Agreed with all of that! It didn't seem worth optimizing further yet, because currently our only use case for this functionality is experimentation, where everything will be a subset of the competition data anyways. There are certainly lots of ways we could do it though, and I agree that we'd want to save results on by-sample level rather than a whole csv. One option is to keep as is for now, and file an issue for making the functionality (of using past PC search results) more broadly available to users down the road. |
Given that this is purely for our own needs and since that file on s3 is not changing, let's just add a single line in the relevant location that says
and then the rest of the code can be the same. We can then file an issue to resolve that to its final state (either remove the check on s3 or build out functionality to support this for users). In short, my preference is to hard code a link to that file rather than trying to configure it since it makes it easier to use and easier to remove. What do you think? EDIT: it looks like we only load from disk once so that first if block can just load the data, not copy it to disk, so
|
I was thinking along the same lines of having a flag instead of specifying an exact path. What about if instead of checking whether the file exists, we have a flag in If we use this if/else:
It will always be true, because the file on s3 will always exist. Instead, we could hard-code the path to our past PC search results, and pull from it if we indicate that we're in "experiment" mode. Thoughts? |
@ejm714 Updated based on our pair earlier, ready for another look!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically there! Just going to be a stickler about something I got burned by in the past
f05a958
to
5371acc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
closes #14
closes #16
14 - Using past PC search results
Uses past results from searching the planetary computer if available.
Config: There is now a
pc_search_results_dir
argument in theFeaturesConfig
for users to point to a directory with PC search results.sentinel_metadata.csv
andsample_item_map.json
Example with directory specified
Output:
Example without directory specified
The below searches the PC. It does not save out the PC search results anywhere to the cache_dir by default, because we often don't want this behavior. E.g., If we run prediction immediately after training as part of an experiment, we want to re-search the PC for the set of prediction samples. We don't want to use the results of searching for the training samples.
Output:
16 - Using saved image arrays
Saves image arrays in an organized way so we can check whether each image already exists, and only download it if not. Each array is identified by a unique combination of the feature window size (
image_feature_meter_window
in the features config), sample ID, pystac item ID, and band name.New organization of image arrays is:
E.g. If
image_feature_meter_window
is 200 anduse_sentinel_bands
is["B02", "B03"]
, the folder for sample0856c3740614b5ee606f82d6c3a215a0
might look like:My thought for where we throw errors is that any pystac item we are using must have all of the required bands, because some of the features use multiple bands. However, if a sample has no imagery, that does not raise an error.
Bonus
ExperimentConfig.run_experiment
, and topipeline.run_prediction
satellite_data.get_items_metadata
based on learnings from running the full PC search.