Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lint errors in datacube script #36

Merged
merged 8 commits into from
Nov 16, 2023
Merged

Fix lint errors in datacube script #36

merged 8 commits into from
Nov 16, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Nov 15, 2023

Fixing some pre-commit lint errors from #27, mostly by wrapping line-lengths of docstrings to under 88 characters.

Wrapping docstrings in scripts/datacube.py to under 88 characters.
@weiji14 weiji14 added the documentation Improvements or additions to documentation label Nov 15, 2023
@weiji14 weiji14 self-assigned this Nov 15, 2023
@weiji14 weiji14 changed the title 🚨 Fix E501 Line too long Fix lint errors in datacube script Nov 15, 2023
Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function.
Comment on lines 150 to 179
best_nodata = (
s2_items_gdf[["s2:nodata_pixel_percentage"]]
.groupby(["s2:nodata_pixel_percentage"])
.sum()
.sort_values(by="s2:nodata_pixel_percentage", ascending=True)
.index[0]
)

best_clouds = (
s2_items_gdf[["eo:cloud_cover"]]
.groupby(["eo:cloud_cover"])
.sum()
.sort_values(by="eo:cloud_cover", ascending=True)
.index[0]
)
least_nodata_and_clouds = s2_items_gdf.sort_values(
by=["s2:nodata_pixel_percentage", "eo:cloud_cover"], ascending=True
).index[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lillythomas, the linter was picking up that the best_nodata variable wasn't used. I've refactored this to sort based on lowest s2:nodata_pixel_percentage first, and then lowest eo:cloud_cover, so the resulting dataframe (searching between dates 2020-01-13/2020-01-19) would look something like this:

                      datetime  s2:nodata_pixel_percentage  eo:cloud_cover
6  2020-01-17T18:47:09.024000Z                    0.000000       33.132727
5  2020-01-22T18:46:51.024000Z                    0.000017       40.740009
1  2020-02-11T18:45:11.024000Z                    0.000043        2.595613
0  2020-02-16T18:44:39.024000Z                    0.000056       10.043734
4  2020-01-27T18:46:29.024000Z                    0.000066       12.765600
2  2020-02-06T18:45:39.024000Z                    0.000136        2.524842
3  2020-02-01T18:46:11.024000Z                    0.000169        1.933268

And we would pick the first row here (datetime: 2020-01-17T18:47:09.024000Z). Does this look ok? Or should we change the logic a bit (e.g. as long as the nodata_pixel_percentage is <20%, we just pick the lowest cloud cover, which would then be datetime: 2020-02-01T18:46:11.024000Z)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I knew that best_nodata wasn't being used because I was mainly thinking that optimizing for least cloud cover is our best bet given that there is a pre-existing filter for nodata by way of {"op": "<=", "args": [{"property": "s2:nodata_pixel_percentage"}, nodata_pixel_percentage]} in the search query. I think this is a good alternative, but in some/many cases it won't optimize for cloud cover (as this example shows where the first row has a nontrivial amount of cloudy pixels - 33 %).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we would just pick the least cloudy image then, as long as the nodata_pixel_percentage is below the 20% threshold? I.e. simply remove the best_nodata variable and stick to having just best_clouds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the best call is for us to just remove the best_nodata variable 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at f65e34a!

Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects.
Increase from default value of 5 to 6,
@@ -13,3 +13,6 @@ select = [
"UP", # pyupgrade
"W", # pycodestyle warnings
]

[tool.ruff.lint.pylint]
max-args = 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the maximum allowed arguments to functions from the default of 5 to 6, to avoid these errors:

scripts/datacube.py:293:5: PLR0913 Too many arguments in function definition (6 > 5)
scripts/datacube.py:428:5: PLR0913 Too many arguments in function definition (6 > 5)

Alternatively, we could look at simplifying those functions to have less arguments.

@weiji14 weiji14 marked this pull request as ready for review November 15, 2023 22:00
Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`.
No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover.
Missed a few more da_ to ds_ renames, following from 2af24be
Copy link
Member

@yellowcap yellowcap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, if I can be extra picky, could we also use lowercase for some variables that are currently upper case, but are not constants?

Examples:

date, YEAR, MONTH, DAY, CLOUD = get_conditions(year1, year2, cloud_cover_percentage)
,
def search_dem(BBOX, catalog, epsg):

Copy link
Member

@yellowcap yellowcap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the remaining improvements for later. I will merge so I can work on this file.

@yellowcap yellowcap merged commit 120f8b1 into main Nov 16, 2023
1 check passed
@yellowcap yellowcap deleted the lint/datacube branch November 16, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants