-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix lint errors in datacube script #36
Conversation
Wrapping docstrings in scripts/datacube.py to under 88 characters.
Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function.
scripts/datacube.py
Outdated
best_nodata = ( | ||
s2_items_gdf[["s2:nodata_pixel_percentage"]] | ||
.groupby(["s2:nodata_pixel_percentage"]) | ||
.sum() | ||
.sort_values(by="s2:nodata_pixel_percentage", ascending=True) | ||
.index[0] | ||
) | ||
|
||
best_clouds = ( | ||
s2_items_gdf[["eo:cloud_cover"]] | ||
.groupby(["eo:cloud_cover"]) | ||
.sum() | ||
.sort_values(by="eo:cloud_cover", ascending=True) | ||
.index[0] | ||
) | ||
least_nodata_and_clouds = s2_items_gdf.sort_values( | ||
by=["s2:nodata_pixel_percentage", "eo:cloud_cover"], ascending=True | ||
).index[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lillythomas, the linter was picking up that the best_nodata
variable wasn't used. I've refactored this to sort based on lowest s2:nodata_pixel_percentage
first, and then lowest eo:cloud_cover
, so the resulting dataframe (searching between dates 2020-01-13/2020-01-19
) would look something like this:
datetime s2:nodata_pixel_percentage eo:cloud_cover
6 2020-01-17T18:47:09.024000Z 0.000000 33.132727
5 2020-01-22T18:46:51.024000Z 0.000017 40.740009
1 2020-02-11T18:45:11.024000Z 0.000043 2.595613
0 2020-02-16T18:44:39.024000Z 0.000056 10.043734
4 2020-01-27T18:46:29.024000Z 0.000066 12.765600
2 2020-02-06T18:45:39.024000Z 0.000136 2.524842
3 2020-02-01T18:46:11.024000Z 0.000169 1.933268
And we would pick the first row here (datetime: 2020-01-17T18:47:09.024000Z
). Does this look ok? Or should we change the logic a bit (e.g. as long as the nodata_pixel_percentage is <20%, we just pick the lowest cloud cover, which would then be datetime: 2020-02-01T18:46:11.024000Z
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I knew that best_nodata
wasn't being used because I was mainly thinking that optimizing for least cloud cover is our best bet given that there is a pre-existing filter for nodata by way of {"op": "<=", "args": [{"property": "s2:nodata_pixel_percentage"}, nodata_pixel_percentage]}
in the search query. I think this is a good alternative, but in some/many cases it won't optimize for cloud cover (as this example shows where the first row has a nontrivial amount of cloudy pixels - 33 %).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we would just pick the least cloudy image then, as long as the nodata_pixel_percentage
is below the 20% threshold? I.e. simply remove the best_nodata
variable and stick to having just best_clouds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the best call is for us to just remove the best_nodata
variable 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done at f65e34a!
Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects.
Increase from default value of 5 to 6,
@@ -13,3 +13,6 @@ select = [ | |||
"UP", # pyupgrade | |||
"W", # pycodestyle warnings | |||
] | |||
|
|||
[tool.ruff.lint.pylint] | |||
max-args = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the maximum allowed arguments to functions from the default of 5 to 6, to avoid these errors:
scripts/datacube.py:293:5: PLR0913 Too many arguments in function definition (6 > 5)
scripts/datacube.py:428:5: PLR0913 Too many arguments in function definition (6 > 5)
Alternatively, we could look at simplifying those functions to have less arguments.
Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`.
No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover.
Missed a few more da_ to ds_ renames, following from 2af24be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, if I can be extra picky, could we also use lowercase for some variables that are currently upper case, but are not constants?
Examples:
Line 409 in 3608d6d
date, YEAR, MONTH, DAY, CLOUD = get_conditions(year1, year2, cloud_cover_percentage) |
Line 239 in 3608d6d
def search_dem(BBOX, catalog, epsg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the remaining improvements for later. I will merge so I can work on this file.
Fixing some pre-commit lint errors from #27, mostly by wrapping line-lengths of docstrings to under 88 characters.