Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: save COGs with dask to azure #195

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

wietzesuijker
Copy link
Contributor

@wietzesuijker wietzesuijker commented Dec 13, 2024

This PR introduces Azure Blob Storage support for saving COGs with Dask, complementing the existing AWS S3 support.

  • Implements a MultiPartUpload class in _az.py to handle Azure-specific multipart upload operations, including block staging, finalization, and cancellation.
  • Adds MultiPartUploadBase in _multipart.py to provide a shared interface for multipart uploads, applicable to both Azure and S3 implementations.
  • Updates save_cog_with_dask to recognize Azure URLs and integrate the Azure multipart upload workflow.
  • Adds unit tests in tests/test_az.py using mocked Azure SDK to verify functionality specific to Azure Blob Storage.

Here’s a simple test I ran to verify the Azure implementation:

import dask.distributed
import planetary_computer as pc
from odc.geo.cog._tifffile import save_cog_with_dask
from odc.stac import configure_rio, stac_load
from pystac_client import Client

# Setup Dask client for efficient processing
client = dask.distributed.Client()
configure_rio(cloud_defaults=True, client=client)

# Query STAC API for Sentinel-2 data
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
query = catalog.search(
    collections=["sentinel-2-l2a"],
    datetime="2019-06",
    query={"s2:mgrs_tile": dict(eq="06VVN")},
)
items = list(query.items())
print(f"Found {len(items)} datasets")

# Load Sentinel-2 data with specific bands
resolution = 40 
xx = stac_load(
    items,
    bands=["SCL"],
    resolution=resolution,
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    dtype="uint16",
    nodata=0,
)

# Select a single time slice for saving
to_save = xx.SCL.isel(time=3)

# Azure Blob Storage details
blob_name = f"SCL-{to_save.time.dt.strftime('%Y%m%d').item()}.tif"
azure_url = f"az://{container}/{blob_name}"


result = save_cog_with_dask(
    to_save,
    dst=f"az://{container}/{blob_name}",
    compression="DEFLATE",
    predictor=2,
    overview_resampling="nearest",
    azure={
        "account_url": f"https://{storage_account}.blob.core.windows.net",
        "credential": sas_token,  # can be blob specific or container wide
    },
    blocksize=[512, 512],
)

# Compute the result to trigger the actual upload
result.compute()
print(f"COG successfully uploaded to: {azure_url}")

While implementing this, I noticed a couple of quirks in the original code:

  • Spelling Inconsistencies: Words like finalise vs. finalize were used. I used ise rather than ize.
  • Typing Updates: The code inconsistently uses Dict/List from typing vs. built-ins (dict/list). I've changed it to Dict and List since the dev-env.yml specifies python=3.8 (but I hadn't used python 3.8 in ages).

As this is my first PR here, I look forward to feedback and suggestions for improvement!

@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch 3 times, most recently from 9bd48c9 to 7b6b553 Compare December 13, 2024 16:50
@Kirill888
Copy link
Member

@wietzesuijker thanks for the PR.

Can I ask you to undo dict -> Dict, list -> List changes, we no longer aim to support 3.8 going forward as most of our dependencies do not support it either. This also blows up review effort. When making changes like this please try to separate "clean up changes" from "actual changes" and use separate commits for each separate logical unit of work, example:

  • fix spelling
  • mypy harmonization
  • add doc comments
  • implement azure back-end

I'll need to find some time to review this properly and to remember what was the original plan for extending to other blob stores. At this point I'm not convinced we need a base class, but my memory of this code is hazy. In the meantime please

  • undo mypy changes
  • separate changes to doc comments from the "real change"
  • make sure to run black formatter, or do the pre-commit thing
  • push new change history into this PR with --force-with-lease option

odc/geo/cog/_s3.py Outdated Show resolved Hide resolved
tests/test_az.py Outdated Show resolved Hide resolved
@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch 2 times, most recently from 00d3ff0 to d80b5cd Compare December 16, 2024 21:41
@wietzesuijker
Copy link
Contributor Author

wietzesuijker commented Dec 16, 2024

Thanks @Kirill888; I addressed your suggestions.

If you're not too familiar with Azure and want to give it a spin, you can:

  • Sign up at Azure Free.
  • In Azure Portal, create a storage account and container.
  • Generate a SAS token in the storage account's Shared access signature settings.
  • Use the storage account URL and SAS token in the PR tests.
  • Delete the resource group after testing.

Thanks again, and great to have your input on this!

odc/geo/cog/_tifffile.py Outdated Show resolved Hide resolved
odc/geo/cog/_az.py Outdated Show resolved Hide resolved
@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch 4 times, most recently from 69198e5 to a54072f Compare December 17, 2024 14:38
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 35.86207% with 93 lines in your changes missing coverage. Please review.

Project coverage is 94.05%. Comparing base (9439ecc) to head (9fcc5d5).

Files with missing lines Patch % Lines
odc/geo/cog/_az.py 0.00% 60 Missing ⚠️
odc/geo/cog/_tifffile.py 33.33% 16 Missing ⚠️
odc/geo/cog/_mpu.py 16.66% 10 Missing ⚠️
odc/geo/cog/_multipart.py 89.28% 3 Missing ⚠️
odc/geo/_interop.py 66.66% 2 Missing ⚠️
odc/geo/cog/_s3.py 75.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #195      +/-   ##
===========================================
- Coverage    95.48%   94.05%   -1.43%     
===========================================
  Files           31       33       +2     
  Lines         5532     5649     +117     
===========================================
+ Hits          5282     5313      +31     
- Misses         250      336      +86     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Kirill888
Copy link
Member

Optional Dependency Handling

Example:

# pylint: disable=import-outside-toplevel
# pylint: disable=too-many-lines
if have.rasterio:
from ._compress import compress
from ._map import add_to, explore
from .cog import to_cog, write_cog
from .warp import rio_reproject

The have object is here:

@property
def geopandas(self) -> bool:
return self._check("geopandas")
@property
def dask(self) -> bool:
return self._check("dask")

Add azure to that, then use it instead of try imports.

Tests

We use pytest for tests, here is example of skipping whole file when needed dependencies are not present

import pytest
rasterio = pytest.importorskip("rasterio")
gpd = pytest.importorskip("geopandas")

Packaging

Add optional azure feature here:

odc-geo/setup.cfg

Lines 54 to 60 in 62b7c5e

s3 =
boto3
all =
%(warp)s
%(tiff)s
%(s3)s

Run checks locally

mypy and pylint checks are failing, run them locally or setup actions in your fork, actions on the PR require manual approval from maintainers.

Simplifying Review process

There is now quite a bit of cleanup changes here, let's have those in a separate PR that I can merge, then we can focus on actual change without getting bogged down by all the mypy harmonization changes.

@wietzesuijker wietzesuijker mentioned this pull request Dec 18, 2024
@wietzesuijker
Copy link
Contributor Author

Added a PR for the cleanup. I'll give the rest another push tomorrow (UK time). Hopefully, that will give us enough time to review before the holiday season; if not, we can review again in 2025 :).

@Kirill888
Copy link
Member

@wietzesuijker other PR is merged, and some other changes, please rebase this against main.

@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch 2 times, most recently from db0c6f1 to 672944d Compare December 19, 2024 22:47
@wietzesuijker
Copy link
Contributor Author

Thanks @Kirill888! I didn't get as much done as I'd liked. I updated the test for az and started using have.azure and have.s3. Could you review the code to see if I'm on the right track? I might need to git rebase and rewrite the history, but I pushed that issue into tomorrow as I was working on this when it wasn't merged yet.
Thanks again.

odc/geo/_interop.py Outdated Show resolved Hide resolved
tests/test_az.py Outdated Show resolved Hide resolved
@Kirill888
Copy link
Member

@wietzesuijker there is currently a conflict after merging your other PR, can you please rebase this branch against develop branch: git pull --rebase upstream/develop or something along these lines.

@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch from 7ffe111 to 0802a9f Compare January 6, 2025 19:17
@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch from 0802a9f to bc8811c Compare January 6, 2025 19:24
@Kirill888 Kirill888 force-pushed the feat/save-cog-to-azure branch 2 times, most recently from 9db7b9a to ad98235 Compare January 7, 2025 03:58
@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch 4 times, most recently from 395f266 to e1b6187 Compare January 7, 2025 20:45
Kirill888 and others added 2 commits January 8, 2025 11:45
- Restored type hints in `upload` for improved type safety.
- Added `get_mpu_kwargs` to centralize shared keyword arguments.
- Simplified `upload` and `mpu_upload` implementations by reusing `get_mpu_kwargs`.
- Reduced code duplication across `_mpu.py` and `_multipart.py`.
@wietzesuijker wietzesuijker force-pushed the feat/save-cog-to-azure branch from e1b6187 to 9fcc5d5 Compare January 8, 2025 11:45
@wietzesuijker
Copy link
Contributor Author

@Kirill888 I've added some changes to fix non-related mypy errors in 3b0223c and squashed that with your commit.
Also, pylint complained about duplicate code, so I pushed e1b6187.

The tests pass (after a few tries), and the basic example in the description of the PR still works.
If you have any other suggestions, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants