Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lint errors in datacube script #36

Merged
merged 8 commits into from
Nov 16, 2023
155 changes: 96 additions & 59 deletions scripts/datacube.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,36 @@
"""
STAC Data Processing Script

This Python script processes Sentinel-2, Sentinel-1, and DEM (Digital Elevation Model) data. It utilizes the Planetary Computer API for data retrieval and manipulation.
This Python script processes Sentinel-2, Sentinel-1, and Copernicus DEM
(Digital Elevation Model) data. It utilizes Microsoft's Planetary Computer API
for data retrieval and manipulation.

Constants:
- STAC_API: Planetary Computer API endpoint
- S2_BANDS: Bands used in Sentinel-2 data processing

Functions:
- random_date(start_year, end_year): Generate a random date within a specified range.
- get_week(year, month, day): Get the week range for a given date.
- get_conditions(year1, year2, cloud_cover_percentage): Get random conditions (date, year, month, day, cloud cover) within a specified year range.
- search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage): Search for Sentinel-2 items within a given week and area of interest.
- search_sentinel1(BBOX, catalog, week): Search for Sentinel-1 items within a given bounding box, STAC catalog, and week.
- search_dem(BBOX, catalog, epsg): Search for DEM items within a given bounding box.
- make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg): Create xarray DataArrays for Sentinel-2, Sentinel-1, and DEM data.
- merge_datarrays(da_sen2, da_sen1, da_dem): Merge xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
- process(year1, year2, aoi, resolution): Process Sentinel-2, Sentinel-1, and DEM data for a specified time range, area of interest, and resolution.
- random_date(start_year, end_year):
Generate a random date within a specified range.
- get_week(year, month, day):
Get the week range for a given date.
- get_conditions(year1, year2, cloud_cover_percentage):
Get random conditions (date, year, month, day, cloud cover) within a
specified year range.
- search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage):
Search for Sentinel-2 items within a given week and area of interest.
- search_sentinel1(BBOX, catalog, week):
Search for Sentinel-1 items within a given bounding box, STAC catalog,
and week.
- search_dem(BBOX, catalog, epsg):
Search for DEM items within a given bounding box.
- make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg):
Create xarray DataArrays for Sentinel-2, Sentinel-1, and DEM data.
- merge_datarrays(da_sen2, da_sen1, da_dem):
Merge xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
- process(year1, year2, aoi, resolution):
Process Sentinel-2, Sentinel-1, and DEM data for a specified time range,
area of interest, and resolution.
"""

import random
Expand Down Expand Up @@ -63,7 +77,8 @@ def get_week(year, month, day):
- day (int): The day of the date.

Returns:
- str: A string representing the start and end dates of the week in the format 'start_date/end_date'.
- str: A string representing the start and end dates of the week in the
format 'start_date/end_date'.
"""
date = datetime(year, month, day)
start_of_week = date - timedelta(days=date.weekday())
Expand All @@ -75,15 +90,18 @@ def get_week(year, month, day):

def get_conditions(year1, year2, cloud_cover_percentage):
"""
Get random conditions (date, year, month, day, cloud cover) within the specified year range.
Get random conditions (date, year, month, day, cloud cover) within the
specified year range.

Parameters:
- year1 (int): The starting year of the date range.
- year2 (int): The ending year of the date range.
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage for Sentinel-2 images.
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage
for Sentinel-2 images.

Returns:
- tuple: A tuple containing date, year, month, day, and a constant cloud cover value.
- tuple: A tuple containing date, year, month, day, and a constant cloud
cover value.
"""
date = random_date(year1, year2)
YEAR = date.year
Expand All @@ -95,20 +113,29 @@ def get_conditions(year1, year2, cloud_cover_percentage):

def search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage):
"""
Search for Sentinel-2 items within a given week and area of interest (AOI) with specified conditions.
Search for Sentinel-2 items within a given week and area of interest (AOI)
with specified conditions.

Parameters:
- week (str): The week in the format 'start_date/end_date'.
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of Interest (AOI).
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage for Sentinel-2 images.
- nodata_pixel_percentage (int): Maximum acceptable percentage of nodata pixels in Sentinel-2 images.
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of
Interest (AOI).
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage
for Sentinel-2 images.
- nodata_pixel_percentage (int): Maximum acceptable percentage of nodata
pixels in Sentinel-2 images.

Returns:
- tuple: A tuple containing the STAC catalog, Sentinel-2 items, the bounding box (BBOX), and an EPSG code for the coordinate reference system.
- tuple: A tuple containing the STAC catalog, Sentinel-2 items, the
bounding box (BBOX), and an EPSG code for the coordinate reference
system.

Note:
The function filters Sentinel-2 items based on the specified conditions such as geometry, date, cloud cover, and nodata pixel percentage.
The result is returned as a tuple containing the STAC catalog, Sentinel-2 items, the bounding box of the first item, and an EPSG code for the coordinate reference system.
The function filters Sentinel-2 items based on the specified conditions
such as geometry, date, cloud cover, and nodata pixel percentage. The
result is returned as a tuple containing the STAC catalog, Sentinel-2
items, the bounding box of the first item, and an EPSG code for the
coordinate reference system.
"""

CENTROID = aoi.centroid
Expand Down Expand Up @@ -147,25 +174,15 @@ def search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage)

s2_items_gdf = gpd.GeoDataFrame.from_features(s2_items.to_dict())

best_nodata = (
s2_items_gdf[["s2:nodata_pixel_percentage"]]
.groupby(["s2:nodata_pixel_percentage"])
.sum()
.sort_values(by="s2:nodata_pixel_percentage", ascending=True)
.index[0]
)

best_clouds = (
s2_items_gdf[["eo:cloud_cover"]]
.groupby(["eo:cloud_cover"])
.sum()
.sort_values(by="eo:cloud_cover", ascending=True)
.index[0]
)
least_nodata_and_clouds = s2_items_gdf.sort_values(
by=["s2:nodata_pixel_percentage", "eo:cloud_cover"], ascending=True
).index[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lillythomas, the linter was picking up that the best_nodata variable wasn't used. I've refactored this to sort based on lowest s2:nodata_pixel_percentage first, and then lowest eo:cloud_cover, so the resulting dataframe (searching between dates 2020-01-13/2020-01-19) would look something like this:

                      datetime  s2:nodata_pixel_percentage  eo:cloud_cover
6  2020-01-17T18:47:09.024000Z                    0.000000       33.132727
5  2020-01-22T18:46:51.024000Z                    0.000017       40.740009
1  2020-02-11T18:45:11.024000Z                    0.000043        2.595613
0  2020-02-16T18:44:39.024000Z                    0.000056       10.043734
4  2020-01-27T18:46:29.024000Z                    0.000066       12.765600
2  2020-02-06T18:45:39.024000Z                    0.000136        2.524842
3  2020-02-01T18:46:11.024000Z                    0.000169        1.933268

And we would pick the first row here (datetime: 2020-01-17T18:47:09.024000Z). Does this look ok? Or should we change the logic a bit (e.g. as long as the nodata_pixel_percentage is <20%, we just pick the lowest cloud cover, which would then be datetime: 2020-02-01T18:46:11.024000Z)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I knew that best_nodata wasn't being used because I was mainly thinking that optimizing for least cloud cover is our best bet given that there is a pre-existing filter for nodata by way of {"op": "<=", "args": [{"property": "s2:nodata_pixel_percentage"}, nodata_pixel_percentage]} in the search query. I think this is a good alternative, but in some/many cases it won't optimize for cloud cover (as this example shows where the first row has a nontrivial amount of cloudy pixels - 33 %).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we would just pick the least cloudy image then, as long as the nodata_pixel_percentage is below the 20% threshold? I.e. simply remove the best_nodata variable and stick to having just best_clouds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the best call is for us to just remove the best_nodata variable 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at f65e34a!


s2_items_gdf = s2_items_gdf[s2_items_gdf["eo:cloud_cover"] == best_clouds]
s2_items_gdf = s2_items_gdf.iloc[least_nodata_and_clouds]
s2_items_gdf

# Get the item ID for the filtered Sentinel 2 dataframe containing the best cloud free scene
# Get the datetime for the filtered Sentinel 2 dataframe
# containing the least nodata and least cloudy scene
s2_items_gdf_datetime_id = s2_items_gdf["datetime"]
for item in s2_items:
if item.properties["datetime"] == s2_items_gdf_datetime_id[0]:
Expand All @@ -174,7 +191,7 @@ def search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage)
else:
continue

BBOX = s2_items_gdf.iloc[0].geometry.bounds
BBOX = s2_items_gdf.iloc[0].bounds

epsg = s2_item.properties["proj:epsg"]
print("EPSG code based on Sentinel-2 item: ", epsg)
Expand All @@ -184,19 +201,24 @@ def search_sentinel2(week, aoi, cloud_cover_percentage, nodata_pixel_percentage)

def search_sentinel1(BBOX, catalog, week):
"""
Search for Sentinel-1 items within a given bounding box (BBOX), STAC catalog, and week.
Search for Sentinel-1 items within a given bounding box (BBOX), STAC
catalog, and week.

Parameters:
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- BBOX (tuple): Bounding box coordinates in the format
(minx, miny, maxx, maxy).
- catalog (pystac.Catalog): STAC catalog containing Sentinel-1 items.
- week (str): The week in the format 'start_date/end_date'.

Returns:
- pystac.Collection: A collection of Sentinel-1 items filtered by specified conditions.
- pystac.Collection: A collection of Sentinel-1 items filtered by specified
conditions.

Note:
This function retrieves Sentinel-1 items from the catalog that intersect with the given bounding box and fall within the provided time window.
The function filters items based on orbit state and returns the collection of Sentinel-1 items that meet the defined criteria.
This function retrieves Sentinel-1 items from the catalog that intersect
with the given bounding box and fall within the provided time window. The
function filters items based on orbit state and returns the collection of
Sentinel-1 items that meet the defined criteria.
"""

geom_BBOX = box(*BBOX) # Create poly geom object from the bbox
Expand Down Expand Up @@ -238,15 +260,18 @@ def search_sentinel1(BBOX, catalog, week):

def search_dem(BBOX, catalog, epsg):
"""
Search for Digital Elevation Model (DEM) items within a given bounding box (BBOX), STAC catalog, week, and Sentinel-2 items.
Search for Copernicus Digital Elevation Model (DEM) items within a given
bounding box (BBOX), STAC catalog, and Sentinel-2 items.

Parameters:
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- BBOX (tuple): Bounding box coordinates in the format
(minx, miny, maxx, maxy).
- catalog (pystac.Catalog): STAC catalog containing DEM items.
- epsg (int): EPSG code for the coordinate reference system.

Returns:
- pystac.Collection: A collection of Digital Elevation Model (DEM) items filtered by specified conditions.
- pystac.Collection: A collection of Digital Elevation Model (DEM) items
filtered by specified conditions.
"""
search = catalog.search(collections=["cop-dem-glo-30"], bbox=BBOX)
dem_items = search.item_collection()
Expand All @@ -261,18 +286,21 @@ def search_dem(BBOX, catalog, epsg):

def make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg):
"""
Create xarray DataArrays for Sentinel-2, Sentinel-1, and DEM data.
Create xarray DataArrays for Sentinel-2, Sentinel-1, and Copernicus DEM
data.

Parameters:
- s2_items (list): List of Sentinel-2 items.
- s1_items (list): List of Sentinel-1 items.
- dem_items (list): List of DEM items.
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- BBOX (tuple): Bounding box coordinates in the format
(minx, miny, maxx, maxy).
- resolution (int): Spatial resolution.
- epsg (int): EPSG code for the coordinate reference system.

Returns:
- tuple: A tuple containing xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
- tuple: A tuple containing xarray DataArrays for Sentinel-2, Sentinel-1,
and Copernicus DEM.
"""
da_sen2: xr.DataArray = stackstac.stack(
items=s2_items,
Expand All @@ -286,7 +314,7 @@ def make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg):
)

da_sen1: xr.DataArray = stackstac.stack(
items=s1_items, # To only accept the same orbit state and date. Need better way to do this.
items=s1_items,
assets=["vh", "vv"], # SAR polarizations
epsg=epsg,
bounds_latlon=BBOX, # W, S, E, N
Expand Down Expand Up @@ -363,17 +391,22 @@ def make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg):

def merge_datarrays(da_sen2, da_sen1, da_dem):
"""
Merge xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
Merge xarray DataArrays for Sentinel-2, Sentinel-1, and Copernicus DEM.

Parameters:
- da_sen2 (xr.DataArray): xarray DataArray for Sentinel-2 data.
- da_sen1 (xr.DataArray): xarray DataArray for Sentinel-1 data.
- da_dem (xr.DataArray): xarray DataArray for DEM data.
- da_dem (xr.DataArray): xarray DataArray for Copernicus DEM data.

Returns:
- xr.DataArray: Merged xarray DataArray.
"""
# print("Platform variables (S2, S1, DEM): ", da_sen2.platform.values, da_sen1.platform.values, da_dem.platform.values)
# print(
# "Platform variables (S2, S1, DEM): ",
# da_sen2.platform.values,
# da_sen1.platform.values,
# da_dem.platform.values,
# )
# da_sen2 = da_sen2.drop(["platform", "constellation"])
# da_sen1 = da_sen1.drop(["platform", "constellation"])
# da_dem = da_dem.drop(["platform"])
Expand All @@ -390,17 +423,21 @@ def process(
year1, year2, aoi, resolution, cloud_cover_percentage, nodata_pixel_percentage
):
"""
Process Sentinel-2, Sentinel-1, and DEM data for a specified time range, area of interest (AOI),
resolution, EPSG code, cloud cover percentage, and nodata pixel percentage.
Process Sentinel-2, Sentinel-1, and Copernicus DEM data for a specified
time range, area of interest (AOI), resolution, EPSG code, cloud cover
percentage, and nodata pixel percentage.

Parameters:
- year1 (int): The starting year of the date range.
- year2 (int): The ending year of the date range.
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of Interest (AOI).
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of
Interest (AOI).
- resolution (int): Spatial resolution.
- epsg (int): EPSG code for the coordinate reference system.
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage for Sentinel-2 images.
- nodata_pixel_percentage (int): Maximum acceptable percentage of nodata pixels in Sentinel-2 images.
- cloud_cover_percentage (int): Maximum acceptable cloud cover percentage
for Sentinel-2 images.
- nodata_pixel_percentage (int): Maximum acceptable percentage of nodata
pixels in Sentinel-2 images.

Returns:
- xr.DataArray: Merged xarray DataArray containing processed data.
Expand Down