Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distance_to_anomaly (Closes #209) #324

Merged
merged 17 commits into from
Mar 15, 2024

Conversation

nialov
Copy link
Collaborator

@nialov nialov commented Feb 13, 2024

Performance issues are a problem which can be solved by using the alternative functionality that uses gdal_proximity function instead of existing distance_computation implementation. The GDAL implementation is orders of magnitude faster.

Draft for now until I clean up the implementation.

@tomironkko made an initial draft of this function which might have been better than the current shapely distance calculation that uses the existing distance_computation one but more than likely much worse than the gdal_proximity method.

@nialov nialov force-pushed the 209-add-distance-to-anomaly-v2 branch from 847eb53 to c789da9 Compare February 14, 2024 09:47
@nialov
Copy link
Collaborator Author

nialov commented Feb 14, 2024

Ready for review. The main function, distance_to_anomaly, works on all platforms while distance_to_anomaly_gdal only works on Linux and MacOS. Based on initial tests I thought the GDAL utility functions would work on all platforms but found out later they are inconsistent on Windows. The main problem on Windows is the lack of osgeo_utils module in the conda environment.

Anyway, thought it best to keep the GDAL method for later reference. I just marked the corresponding test as an xfail when run on Windows. It is orders of magnitude faster and at least somewhat more memory-efficient (filtering to anomalous values is still done in-memory).

If distance_computation, which the main function relies on, requires optimization later, you should check out the original branch by @tomironkko here: master...209-add-distance-to-anomaly. The "raster" distance math approach is more efficient but it can obviously only handle raster data while the current distance_computation handles vector data as well. You could rasterize first and then use raster math. Alternatively the missing osgeo_utils could be investigated.

@nialov nialov marked this pull request as ready for review February 14, 2024 10:04
Copy link
Collaborator

@nmaarnio nmaarnio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, looks really good overall! I wrote a few comments that mainly regard docs and checks. I think we can keep the GDAL version.

Comment on lines 39 to 36
anomaly_raster: The raster in which the distances
to the nearest anomalous value are determined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the parameters, anomaly_raster_profile and anomaly_raster_data are separate. I suppose you just forgot to update the docs here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updating!

Comment on lines 23 to 21
raster_profile: Union[profiles.Profile, dict],
anomaly_raster_profile: Union[profiles.Profile, dict],
anomaly_raster_data: np.ndarray,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question about these: Is raster_profile the target/output profile? For me, it was not (and maybe isn't still) completely clear why we need to specify two raster profiles. In what cases raster_profile != anomaly_raster_profile? If we keep both, the naming/docs could be clarified.

Also, our raster processing functions have taken in Rasterio's DatasetReaders so far and this implementation conflicts that convention. However, I have to say I prefer this way in general, so maybe this does not need to be changed and we can considering refactoring the other tools.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the initial (non-gdal) approach it was possible to calculate the distances for a raster that is not equivalent in profile to the raster in which the anomalies are determined from. So anomaly raster could have e.g. different resolution to the raster in which the distances were calculated to. For simplification sense I have now removed this option, I suppose it is probably not a common use case. Easier to maintain without it 👍

Comment on lines 26 to 27
threshold_criteria_value: THRESHOLD_CRITERIA_VALUE_TYPE,
threshold_criteria: THRESHOLD_CRITERIA_TYPE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These input types do not transfer to the docs well. I would rather just put the Union/Literal types directly to the function signature for better user experience and documentation even if I understand they are used repeatedly in this module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I keep trying to minimize code related to type annotations but Python does not make it easy 😢 . I will update to the explicit versions.

anomaly_raster_data: np.ndarray,
threshold_criteria_value: THRESHOLD_CRITERIA_VALUE_TYPE,
threshold_criteria: THRESHOLD_CRITERIA_TYPE,
) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have had the convention of always returning raster profile/meta in raster processing tools. I would follow this schema in this tool too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Comment on lines 48 to 61
raster_width = raster_profile.get("width")
raster_height = raster_profile.get("height")

if not isinstance(raster_width, int) or not isinstance(raster_height, int):
raise InvalidParameterValueException(
f"Expected raster_profile to contain integer width and height. {raster_profile}"
)

raster_transform = raster_profile.get("transform")

if not isinstance(raster_transform, transform.Affine):
raise InvalidParameterValueException(
f"Expected raster_profile to contain an affine transformation. {raster_profile}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on purpose that we only check raster_profile but not anomaly_raster_profile?

While reviewing this I got an idea: should we add a check called check_raster_meta or check_raster_profile that allows selecting (or deselecting) profile/meta elements to check? This check could essentially do what you are doing here for the specified elements and raise the exceptions too. Currently many of our raster processing functions do not check if metadata contain the required elements.

If we add such a check (and maybe even if we didn't), we could also consider a new dedicated exception type such as InvalidRasterMetaException.

@em-t
Copy link
Collaborator

em-t commented Feb 15, 2024

I did some testing with the non-gdal version using various raster files and using a randomly generated anomaly profile. Everything has looked good so far 👍 I'm wondering, though, if the function should handle nodata values in some more sophisticated way, or is it ok to assume that the user has replaced nodata values with nan for the input raster?

@nialov
Copy link
Collaborator Author

nialov commented Feb 23, 2024

I did some testing with the non-gdal version using various raster files and using a randomly generated anomaly profile. Everything has looked good so far 👍 I'm wondering, though, if the function should handle nodata values in some more sophisticated way, or is it ok to assume that the user has replaced nodata values with nan for the input raster?

Yes, you are correct that if np.nan is not the true nodata value, then the nodata values will be left among the possible anomaly values. I have personally tried to advocate for a centralized approach to handling of nodata but if there are no current plans for such approach, I can implement nodata handling in this function as well (@nmaarnio)?

@nialov
Copy link
Collaborator Author

nialov commented Feb 23, 2024

Ready for review, let me know about the nodata handling!

@nmaarnio
Copy link
Collaborator

Let's implement nodata handling in this tool now. However, if you have time and interest to suggest/implement a centralized approach for nodata handling across the toolkit in the upcoming weeks/months, please go ahead.

Performance issues are a problem which can be solved by using the
alternative functionality that uses gdal_proximity function instead of
existing distance_computation implementation. The GDAL implementation is
orders of magnitude faster.
Based on review by @nmaarnio and @em-t, updated faulty documentation,
simplified inputs and outputs and made a separate check for raster
profile values.
Updated tests according to previous changes and added some new ones to
test raster profile check.
@nialov nialov force-pushed the 209-add-distance-to-anomaly-v2 branch from 334d523 to 5d25fdb Compare March 7, 2024 11:35
@nialov
Copy link
Collaborator Author

nialov commented Mar 7, 2024

Nodata handling implemented and ready for review again. Nodata only affects the definition of anomalous values in _fits_criteria. The output will never have nodata values in it.

Copy link
Collaborator

@nmaarnio nmaarnio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed one potential error in the code, see my comment below. Also, I think it could be useful to add a check that threshold_criteria_value and threshold_criteria match (to check if threshold_criteria_value is a tuple when "in_between" or "outside" is used – not sure if numeric checks for the threshold_criteria_value should be included). I couldn't test this locally and check the notebook, but reading the code looks otherwise good to me!

Comment on lines 114 to 118
"in_between": lambda anomaly_raster_data: np.where(
np.logical_and(anomaly_raster_data > threshold_criteria[0], anomaly_raster_data < threshold_criteria[1])
),
"outside": lambda anomaly_raster_data: np.where(
np.logical_or(anomaly_raster_data < threshold_criteria[0], anomaly_raster_data > threshold_criteria[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like threshold_criteria here should bethreshold_criteria_value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed and added tests for each criteria!

@nialov
Copy link
Collaborator Author

nialov commented Mar 11, 2024

Thanks for the review, @nmaarnio! Ready for review again 👍

Copy link
Collaborator

@nmaarnio nmaarnio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

Copy link
Collaborator

@em-t em-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for leaving my review hanging for so many days. I did some more testing and it looks good to me! I left two minor suggestions, but I'm also ok with merging now. Nice job 👍

threshold_criteria_value: Union[Tuple[Number, Number], Number],
threshold_criteria: Literal["lower", "higher", "in_between", "outside"],
anomaly_raster_data: np.ndarray,
nodata_value: Optional[Number],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to Optional[Number] = None?

(The value of the Optional type-hinted parameter doesn't seem to get set to None by default if omitted. The type hint just allows for the type to be either a number or NoneType.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the function is a private one I think it is best to make sure all inputs are explicitly passed when using the function so no functionality is missed. Using Optional does not mean you need to set the default value, it just means that input is allowed to be passed as None.

for row in rows
]
all_points = list(chain(*all_points_by_rows))
all_points_gdf = gpd.GeoDataFrame(geometry=all_points, crs=anomaly_raster_profile["crs"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a suggestion, feel free to ignore this if you think best.)

If all_points ends up empty due to nothing matching the criteria, this function could raise an exception with an informative message. The distance_computation does raise an exception when checking the shape, but the message is sort of generic, and it might be good to let the user know why it's happened in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a check that the criteria matches at least some of the data!

If the criteria does not match any of the data, an exception is raised to notify
the user.
@nialov
Copy link
Collaborator Author

nialov commented Mar 15, 2024

Ready for review/merging 👍 @nmaarnio, @em-t

@em-t
Copy link
Collaborator

em-t commented Mar 15, 2024

Looks good!

@nmaarnio
Copy link
Collaborator

Merging!

@nmaarnio nmaarnio merged commit e01bf58 into GispoCoding:master Mar 15, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants