-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement distance_to_anomaly (Closes #209) #324
Implement distance_to_anomaly (Closes #209) #324
Conversation
847eb53
to
c789da9
Compare
Ready for review. The main function, Anyway, thought it best to keep the If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, looks really good overall! I wrote a few comments that mainly regard docs and checks. I think we can keep the GDAL version.
anomaly_raster: The raster in which the distances | ||
to the nearest anomalous value are determined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the parameters, anomaly_raster_profile
and anomaly_raster_data
are separate. I suppose you just forgot to update the docs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, updating!
raster_profile: Union[profiles.Profile, dict], | ||
anomaly_raster_profile: Union[profiles.Profile, dict], | ||
anomaly_raster_data: np.ndarray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question about these: Is raster_profile
the target/output profile? For me, it was not (and maybe isn't still) completely clear why we need to specify two raster profiles. In what cases raster_profile
!= anomaly_raster_profile
? If we keep both, the naming/docs could be clarified.
Also, our raster processing functions have taken in Rasterio's DatasetReaders so far and this implementation conflicts that convention. However, I have to say I prefer this way in general, so maybe this does not need to be changed and we can considering refactoring the other tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the initial (non-gdal) approach it was possible to calculate the distances for a raster that is not equivalent in profile to the raster in which the anomalies are determined from. So anomaly raster could have e.g. different resolution to the raster in which the distances were calculated to. For simplification sense I have now removed this option, I suppose it is probably not a common use case. Easier to maintain without it 👍
threshold_criteria_value: THRESHOLD_CRITERIA_VALUE_TYPE, | ||
threshold_criteria: THRESHOLD_CRITERIA_TYPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These input types do not transfer to the docs well. I would rather just put the Union/Literal types directly to the function signature for better user experience and documentation even if I understand they are used repeatedly in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I keep trying to minimize code related to type annotations but Python does not make it easy 😢 . I will update to the explicit versions.
anomaly_raster_data: np.ndarray, | ||
threshold_criteria_value: THRESHOLD_CRITERIA_VALUE_TYPE, | ||
threshold_criteria: THRESHOLD_CRITERIA_TYPE, | ||
) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have had the convention of always returning raster profile/meta in raster processing tools. I would follow this schema in this tool too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
raster_width = raster_profile.get("width") | ||
raster_height = raster_profile.get("height") | ||
|
||
if not isinstance(raster_width, int) or not isinstance(raster_height, int): | ||
raise InvalidParameterValueException( | ||
f"Expected raster_profile to contain integer width and height. {raster_profile}" | ||
) | ||
|
||
raster_transform = raster_profile.get("transform") | ||
|
||
if not isinstance(raster_transform, transform.Affine): | ||
raise InvalidParameterValueException( | ||
f"Expected raster_profile to contain an affine transformation. {raster_profile}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it on purpose that we only check raster_profile
but not anomaly_raster_profile
?
While reviewing this I got an idea: should we add a check called check_raster_meta
or check_raster_profile
that allows selecting (or deselecting) profile/meta elements to check? This check could essentially do what you are doing here for the specified elements and raise the exceptions too. Currently many of our raster processing functions do not check if metadata contain the required elements.
If we add such a check (and maybe even if we didn't), we could also consider a new dedicated exception type such as InvalidRasterMetaException
.
I did some testing with the non-gdal version using various raster files and using a randomly generated anomaly profile. Everything has looked good so far 👍 I'm wondering, though, if the function should handle nodata values in some more sophisticated way, or is it ok to assume that the user has replaced nodata values with nan for the input raster? |
Yes, you are correct that if |
Ready for review, let me know about the nodata handling! |
Let's implement nodata handling in this tool now. However, if you have time and interest to suggest/implement a centralized approach for nodata handling across the toolkit in the upcoming weeks/months, please go ahead. |
Performance issues are a problem which can be solved by using the alternative functionality that uses gdal_proximity function instead of existing distance_computation implementation. The GDAL implementation is orders of magnitude faster.
Updated tests according to previous changes and added some new ones to test raster profile check.
334d523
to
5d25fdb
Compare
Uses nodata value from raster profile and falls back to using np.nan if profile does not contain it.
Nodata handling implemented and ready for review again. Nodata only affects the definition of anomalous values in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed one potential error in the code, see my comment below. Also, I think it could be useful to add a check that threshold_criteria_value
and threshold_criteria
match (to check if threshold_criteria_value
is a tuple when "in_between" or "outside" is used – not sure if numeric checks for the threshold_criteria_value
should be included). I couldn't test this locally and check the notebook, but reading the code looks otherwise good to me!
"in_between": lambda anomaly_raster_data: np.where( | ||
np.logical_and(anomaly_raster_data > threshold_criteria[0], anomaly_raster_data < threshold_criteria[1]) | ||
), | ||
"outside": lambda anomaly_raster_data: np.where( | ||
np.logical_or(anomaly_raster_data < threshold_criteria[0], anomaly_raster_data > threshold_criteria[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like threshold_criteria
here should bethreshold_criteria_value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed and added tests for each criteria!
Thanks for the review, @nmaarnio! Ready for review again 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for leaving my review hanging for so many days. I did some more testing and it looks good to me! I left two minor suggestions, but I'm also ok with merging now. Nice job 👍
threshold_criteria_value: Union[Tuple[Number, Number], Number], | ||
threshold_criteria: Literal["lower", "higher", "in_between", "outside"], | ||
anomaly_raster_data: np.ndarray, | ||
nodata_value: Optional[Number], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default to Optional[Number] = None?
(The value of the Optional type-hinted parameter doesn't seem to get set to None by default if omitted. The type hint just allows for the type to be either a number or NoneType.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the function is a private one I think it is best to make sure all inputs are explicitly passed when using the function so no functionality is missed. Using Optional
does not mean you need to set the default value, it just means that input is allowed to be passed as None
.
for row in rows | ||
] | ||
all_points = list(chain(*all_points_by_rows)) | ||
all_points_gdf = gpd.GeoDataFrame(geometry=all_points, crs=anomaly_raster_profile["crs"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a suggestion, feel free to ignore this if you think best.)
If all_points ends up empty due to nothing matching the criteria, this function could raise an exception with an informative message. The distance_computation does raise an exception when checking the shape, but the message is sort of generic, and it might be good to let the user know why it's happened in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added a check that the criteria matches at least some of the data!
If the criteria does not match any of the data, an exception is raised to notify the user.
Looks good! |
Merging! |
Performance issues are a problem which can be solved by using the alternative functionality that uses gdal_proximity function instead of existing distance_computation implementation. The GDAL implementation is orders of magnitude faster.
Draft for now until I clean up the implementation.
@tomironkko made an initial draft of this function which might have been better than the current
shapely
distance calculation that uses the existingdistance_computation
one but more than likely much worse than thegdal_proximity
method.