Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

384 optimize distance to anomaly #423

Merged

Conversation

okolekar
Copy link
Contributor

@okolekar okolekar commented Sep 5, 2024

Hello,
I have added the function "distance_to_anomaly_gdal_ComputeProximity" to reduce the runtime for distance computation. The new function "distance_to_anomaly_gdal_ComputeProximity" is a C++ based API function and hence is faster as compared to "distance_to_anomaly" (which was using the point based scheme) and "distance_to_anomaly_gdal" (which was using the python API for distance computation) function. All the three produce the exact same result which is verified in the python notebook distance_to_anomaly.ipynb.

Below I have attached some graphs inorder showcase the reduction in time achieved by the new function.

POINT TO NOTE: I have kept the precommits off due to errors.

Thankyou.

image

nmaarnio and others added 3 commits May 2, 2024 11:56
…ce_to_anomaly_gdal_ComputeProximity for reduced run time

The earlier functions namely distance_to_anomaly and distance_to_anomaly_gdal are slower as compared to the newly added function. The new function 'distance_to_anomaly_gdal_ComputeProximity' is atleast 97% more efficient than the old Distance_to_anomaly function and is about 75% more efficient than the distance_to_anomaly_gdal function. The output was tested on 5 different raster files and there was no difference in the output file of the three function found.
@okolekar okolekar linked an issue Sep 5, 2024 that may be closed by this pull request
@okolekar okolekar marked this pull request as draft September 5, 2024 07:03
@okolekar okolekar marked this pull request as ready for review September 5, 2024 07:07
@nialov
Copy link
Collaborator

nialov commented Sep 5, 2024

I will do a more in-depth review later but first please add a test to check the implemented functions! See e.g. tests/raster_processing/test_distance_to_anomaly.py for inspiration on how to implement tests or other files in the tests/ directory. The tests are run by pytest. The purpose of the test(s) is to mainly check that the function works with some pre-defined input and the result is as expected (type is correct, value is sensible, etc.). I see you already added examples of use in the notebook but those are not checked automatically on Github actions so it does not replace actual tests.

@nmaarnio
Copy link
Collaborator

@nialov , do you have time to review this sometime this month? It'd be better if you can, but I can too if needed.

@nialov
Copy link
Collaborator

nialov commented Sep 20, 2024

I will try to review next week @nmaarnio though I can not promise it. Might get delayed to 7.10. due to work trip.

Based on a preliminary look, as this uses the GDAL API, I can foresee problems with Windows which will delay merging.

@okolekar The tests (All checks have failed) are not running successfully, see e.g.:

https://github.com/GispoCoding/eis_toolkit/actions/runs/10809648448/job/30251164407?pr=423

You can check the errors and try to see if you can get them fixed. I can help with this during review but as said before that might take time.

@okolekar
Copy link
Contributor Author

okolekar commented Sep 20, 2024

Hello to all,
I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer.
What I can do is :-
I can add the function @pytest.mark.xfail(
sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError)
Which essentially skips the testing.

Regards!
Omkar Kolekar

@nialov
Copy link
Collaborator

nialov commented Sep 23, 2024

Hello to all, I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer. What I can do is :- I can add the function @pytest.mark.xfail( sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError) Which essentially skips the testing.

Regards! Omkar Kolekar

Do you use a Windows computer and have you installed eis_toolkit with conda, pip or poetry (or some other method)? If the problem is that Windows gdal library does not have the same bindings as linux, then the optimization is not available for Windows users. The target audience probably mostly uses Windows.

@okolekar
Copy link
Contributor Author

Hello to all, I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer. What I can do is :- I can add the function @pytest.mark.xfail( sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError) Which essentially skips the testing.
Regards! Omkar Kolekar

Do you use a Windows computer and have you installed eis_toolkit with conda, pip or poetry (or some other method)? If the problem is that Windows gdal library does not have the same bindings as linux, then the optimization is not available for Windows users. The target audience probably mostly uses Windows.

I do use a Windows machine with a conda environment but for development purposes, I have not installed eis_toolkit in this environment. If this still is a problem, I have an idea: we can create a C++ library/function to calculate the distance to anomaly and call this function in python program. The only issue would be to understand how gdal_ComputeProximity works.
Warm Regards,
Omkar

@nialov
Copy link
Collaborator

nialov commented Sep 24, 2024

I think the primary question here is whythe eis_toolkit environment does not have _gdal_array importable in the Linux (and probably Windows environemnt). What version of gdal do you use? To check version:

➜ gdalinfo --version
GDAL 3.9.2, released 2024/08/13

You should try to get your code to work in the environment defined in environment.yaml or in the eis_toolkit environment installed with conda. See instructions/dev_setup_without_docker_with_conda.md and if it does not work in the environment, try to pinpoint why exactly, i.e., what is the difference between your working environment and environments of eis_toolkit.

@nialov
Copy link
Collaborator

nialov commented Sep 24, 2024

Also I misspoke earlier about Windows difficulties, I was correlating tests failing with issues on Windows platform but the checks here on GitHub run on linux (Ubuntu) only currently. So I suppose the issue is the Linux environment which might be difficult for you to solve if you on Windows. I can take a look but it will sadly get delayed to 7.10. or onwards

@okolekar
Copy link
Contributor Author

Also I misspoke earlier about Windows difficulties, I was correlating tests failing with issues on Windows platform but the checks here on GitHub run on linux (Ubuntu) only currently. So I suppose the issue is the Linux environment which might be difficult for you to solve if you on Windows. I can take a look but it will sadly get delayed to 7.10. or onwards

I checked the GDAL version I use
gdalinfo --version
GDAL 3.6.2, released 2023/01/02
I do work on Windows OS. I need to check the possibility to work on Linux (Ubuntu).

@nialov
Copy link
Collaborator

nialov commented Sep 24, 2024

This branch is based on a somewhat old version of eis_toolkit. Can you do a merge and push it. Run the following commands in the repository with your edits but make sure everything is committed first (no unstaged changes):

# Check that upstream points to GispoCoding/eis_toolkit:
git remote -v
# If it does not or it does not exist you can report back here. Otherwise, continue:
git fetch upstream
git merge upstream/master
# origin should point to okolekar/eis_toolkit
git push origin

You should not run into any conflicts but if the merge is not completed successfully and you get a dirty git status, you can do git merge --abort to abort and report back here.

…ote-tracking branch 'upstream/master' into 384-optimize-distance-computation
@nialov
Copy link
Collaborator

nialov commented Sep 24, 2024

The issue seems to be that gdal is not installed with numpy bindings when installed with poetry, disabling some raster behavior. I believe I have ran into this before. Your tests run successfully on both Linux and Windows when using conda to install the environment.

@okolekar
Copy link
Contributor Author

This branch is based on a somewhat old version of eis_toolkit. Can you do a merge and push it. Run the following commands in the repository with your edits but make sure everything is committed first (no unstaged changes):

# Check that upstream points to GispoCoding/eis_toolkit:
git remote -v
# If it does not or it does not exist you can report back here. Otherwise, continue:
git fetch upstream
git merge upstream/master
# origin should point to okolekar/eis_toolkit
git push origin

You should not run into any conflicts but if the merge is not completed successfully and you get a dirty git status, you can do git merge --abort to abort and report back here.

My merge is completed without any conflicts.

@nialov
Copy link
Collaborator

nialov commented Sep 27, 2024

I will continue this after 7.10. There might be ways to circumvent the poetry environment issue but I am going to guess that if the code is used with the current imports it cannot be run in the poetry environment but works on conda across platforms (and with pip).

Copy link
Collaborator

@nialov nialov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in terms of functionality seems to be working well in tests and in the notebook. My comments mostly relate to style and trying to minimize code duplication, especially in tests. If you have the interest, you could try to refactor some of the code which is identical in functions _distance_to_anomaly_gdal_ComputeProximity and _distance_to_anomaly so that same code is used only once.

@@ -239,3 +240,113 @@ def _distance_to_anomaly(
)

return distance_array

def _distance_to_anomaly_gdal_ComputeProximity(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use snakecase for function names in Python:

Suggested change
def _distance_to_anomaly_gdal_ComputeProximity(
def _distance_to_anomaly_gdal_compute_proximity(

After renaming, check the code, notebook and tests that you use the new name.

)
#converting True False values to binary formant.
converted_values = np.where(data_fits_criteria,1,0)
driver = gdal.GetDriverByName("MEM")
Copy link
Collaborator

@nialov nialov Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gdal does not raise exceptions unless told to do so. I implemented for my own code a function, toggle_gdal_exceptions which is already imported in this file. I suggest to wrap all gdal code with it. E.g.

    with toggle_gdal_exceptions():
        driver = gdal.GetDriverByName("MEM")

        temp_raster = driver.Create("", width, height, 1, gdal.GDT_Float32)
        temp_raster.SetGeoTransform(x_geo + y_geo)
        band = temp_raster.GetRasterBand(1)
        band.WriteArray(converted_values)
        band.SetNoDataValue(nodatavalue)
    
        # Create empty proximity raster
        out_raster = driver.Create("",  width, height, 1, gdal.GDT_Float32)
        out_raster.SetGeoTransform(x_geo + y_geo)   
        out_raster.SetProjection(crs)
        out_band = out_raster.GetRasterBand(1)
        out_band.SetNoDataValue(nodatavalue)
        options = ['values=1','distunits=GEO']
        
        # Compute proximity
        gdal.ComputeProximity(band, out_band, options)

        # Create outputs
        out_array = out_band.ReadAsArray()

Any gdal code run in the block will now raise exceptions.

Alternatively, enable the exceptions as I have done in the function before running your code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have enabled the exceptions globally.

Comment on lines 312 to 316
This function demonstrates superior performance compared to the distance_to_anomaly
and distance_to_anomaly_gdal functions, as it uses a low-level, C++-based API
within the GDAL library. By directly computing the proximity map from the
source dataset, it benefits from the core-level optimizations inherent to GDAL,
ensuring enhanced efficiency and speed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this below the other paragraph as it is not as relevant to the end-user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the notebook for extra code you might have put there during development. See e.g.

import sys
sys.path.append("E:/EIS/aus_CMAAS_Projekt")
from beak.utilities.raster_processing import calculate_distance_from_raster

in one of the cells.

@@ -90,3 +92,36 @@ def _distance_computation(
)

return distance_matrix


def _distance_computation_optimized(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is currently not being used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not code the function '_distance_computation_optimized'. However, I have removed the function in the new pull request.

"expected_mean",
]
),
[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you reuse the same pytest parameters exactly as my test function, I recommend creating a variable with the parameters and using it in both pytest.mark.parametrize inputs:

EXPECTED_PYTEST_PARAMS = [
        pytest.param(
            SMALL_RASTER_PROFILE,
            SMALL_RASTER_DATA,
            5.0,
            "lower",
            EXPECTED_SMALL_RASTER_SHAPE,
            5.694903,
            id="small_raster_lower",
        ),
        pytest.param(
            SMALL_RASTER_PROFILE,
            SMALL_RASTER_DATA,
            5.0,
            "higher",
            EXPECTED_SMALL_RASTER_SHAPE,
            6.451948,
            id="small_raster_higher",
        ),
        ...
        ]

Usage:

@pytest.mark.parametrize(
    ",".join(
        [
            "anomaly_raster_profile",
            "anomaly_raster_data",
            "threshold_criteria_value",
            "threshold_criteria",
            "expected_shape",
            "expected_mean",
        ]
    ),
    EXPECTED_PYTEST_PARAMS,
)
def test_distance_to_anomaly_expected(
@pytest.mark.parametrize(
    ",".join(
        [
            "anomaly_raster_profile",
            "anomaly_raster_data",
            "threshold_criteria_value",
            "threshold_criteria",
            "expected_shape",
            "expected_mean",
        ]
    ),
    EXPECTED_PYTEST_PARAMS,

)
def test_distance_to_anomaly_gdal_ComputeProximity_expected(

Also check if you use same parameters elsewhere and do the same there! (test_distance_to_anomaly_gdal_ComputeProximity_expected_check)

),
],
)
def test_distance_to_anomaly_gdal_ComputeProximity_expected_check(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would critically review, if checking most of the same exceptions as test_distance_to_anomaly_check is required here. At the very least, you need to create a variable with the same parameters as instructed in my above review comment.

@okolekar
Copy link
Contributor Author

The code in terms of functionality seems to be working well in tests and in the notebook. My comments mostly relate to style and trying to minimize code duplication, especially in tests. If you have the interest, you could try to refactor some of the code which is identical in functions _distance_to_anomaly_gdal_ComputeProximity and _distance_to_anomaly so that same code is used only once.

To avoid repetition of code, I have created a function named '_validate_threshold_criteria' that contains the common code from both the functions ('_distance_to_anomaly_gdal_ComputeProximity' and '_distance_to_anomaly') and both the functions call '_validate_threshold_criteria' internally.

1) renamed _distance_to_anomaly_gdal_ComputeProximity to _distance_to_anomaly_gdal_compute_proximity
2) enabled toggle_gdal_exceptions globaly in the distance_to_anomaly.py
3) Removed unused code form python notebook
4) Added all the parameters in a single variable in test_distance_to_anomaly_gdal_compute_proximity_expeccted_check
@nialov
Copy link
Collaborator

nialov commented Oct 21, 2024

I suppose this is ready for review again? I will take a look in a week (or two).

@okolekar
Copy link
Contributor Author

okolekar commented Oct 21, 2024

I suppose this is ready for review again? I will take a look in a week (or two).

Yes, it is ready for review again. Thank you for the support.
I have also taken care of precommits this time, as I have solved the issues with the precommits.

@nmaarnio
Copy link
Collaborator

This PR needs to be merged this week if we want to include it in the released planned for Friday this week. If Nikolas is busy, I might jump in and do a review tomorrow so that testers will be able to run the new version before progress meet.

@okolekar
Copy link
Contributor Author

okolekar commented Oct 21, 2024

This PR needs to be merged this week if we want to include it in the released planned for Friday this week. If Nikolas is busy, I might jump in and do a review tomorrow so that testers will be able to run the new version before progress meet.

Sound okay. If there are any changes needed, I am available to modify them as soon as possible to ensure timely delivery of the code. Also, a point to note here is that the changes were mainly cosmetic.

@nmaarnio
Copy link
Collaborator

Hi @okolekar . I only now realized you have been optimizing distance_to_anomaly here, notdistance_computation. The issue and PR name refer to distance_computation which is the vector processing tool that the original implementation of distance_computation was using under the hood. Did you use any time to think about optimizing distance_computation, or only distance_to_anomaly so far?

Either way, it's of course good that you have been optimizing this tool since it's used a lot. However, we direly need also distance_computation optimized since it's a commonly used tool in MPM. Optimization of distance_computation would also optimize vector_density.

When it comes to actual review of this PR, you should mark the failing test functions to fail if platform is not Windows, so like this:
@pytest.mark.xfail( sys.platform != "win32", reason="gdal_array available only on Windows.", raises=ModuleNotFoundError )
Also, I think we could comment out the old distance_to_anomaly_gdal tool if it is not used, the module has now a lot of tools that might confuse users.

@okolekar
Copy link
Contributor Author

Hi @okolekar . I only now realized you have been optimizing distance_to_anomaly here, notdistance_computation. The issue and PR name refer to distance_computation which is the vector processing tool that the original implementation of distance_computation was using under the hood. Did you use any time to think about optimizing distance_computation, or only distance_to_anomaly so far?

Either way, it's of course good that you have been optimizing this tool since it's used a lot. However, we direly need also distance_computation optimized since it's a commonly used tool in MPM. Optimization of distance_computation would also optimize vector_density.

When it comes to actual review of this PR, you should mark the failing test functions to fail if platform is not Windows, so like this: @pytest.mark.xfail( sys.platform != "win32", reason="gdal_array available only on Windows.", raises=ModuleNotFoundError ) Also, I think we could comment out the old distance_to_anomaly_gdal tool if it is not used, the module has now a lot of tools that might confuse users.

Hi @nmaarnio,
I was asked to optimize the distance_to_anomaly only.
However, I can start with that tool as well. But it will take some time.

Regarding the PR, will upload the suggested changes in an hour or two.
Thanks!

@nmaarnio
Copy link
Collaborator

Okay. I think there has been some miscommunication at some point. Anyway, I will rename this PR and relink the issue as this does not concern distance_computation

@nmaarnio nmaarnio linked an issue Oct 22, 2024 that may be closed by this pull request
@nmaarnio nmaarnio changed the title 384 optimize distance computation 384 optimize distance to anomaly Oct 22, 2024
Commented the old distance_to_anomaly_gdal as there were too many functions for the user. Added an exception in the tests for gdal_array as the gdal_array is available only on the Windows system. Further more updates in the notebook are made as the old distance_to_anomaly_gdal is commented and hence is removed here.
@okolekar
Copy link
Contributor Author

@nmaarnio I have made the requested changes. Now I will shift my focus to optimize the Distance_computation in the vector processing tool.

Copy link
Collaborator

@nmaarnio nmaarnio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you do the small commenting out change then I'll merge :) thanks for your work with this PR

EDIT: I'll actually do this myself since it's such a small cosmetic thing and then merge the PR 👍🏻

@nmaarnio
Copy link
Collaborator

And sorry @nialov for bypassing you in the review process. There is just some pressure to get this, among some other features/improvements, merged before the next progress meeting. You are of course free to review the implementation when you have time and make improvement suggestions even after this version has been merged.

@nmaarnio nmaarnio merged commit c66cd68 into GispoCoding:master Oct 23, 2024
4 checks passed
@nialov
Copy link
Collaborator

nialov commented Oct 24, 2024

Hey, no problem @nmaarnio . The pull request was quite fine already with tests confirming the functionality so second review would not have been too important.

Also to note that the gdal_array error does not work due to poetry, independent of platform. So it would work if the test installed with pip on linux. Of course poetry is only used on Linux so disabling it on that platform works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Distance to anomaly
3 participants