Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(az): optimize non-recursive _list_dir #447

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
- fix: use native `exists()` method in `GSClient`. (PR [#420](https://github.com/drivendataorg/cloudpathlib/pull/420))
- Enhancement: lazy instantiation of default client (PR [#432](https://github.com/drivendataorg/cloudpathlib/issues/432), Issue [#428](https://github.com/drivendataorg/cloudpathlib/issues/428))
- Adds existence check before downloading in `download_to` (Issue [#430](https://github.com/drivendataorg/cloudpathlib/issues/430), PR [#432](https://github.com/drivendataorg/cloudpathlib/pull/432))
- perf(az): optimize non-recursive _list_dir (Issue [#446](https://github.com/drivendataorg/cloudpathlib/issues/446), PR [#447](https://github.com/drivendataorg/cloudpathlib/pull/447))
- fix(az): improve detection of path types (Issue [#444](https://github.com/drivendataorg/cloudpathlib/issues/444), PR [#447](https://github.com/drivendataorg/cloudpathlib/pull/447))

## v0.18.1 (2024-02-26)

Expand Down
86 changes: 48 additions & 38 deletions cloudpathlib/azure/azblobclient.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
from datetime import datetime, timedelta
import mimetypes
import os
from pathlib import Path, PurePosixPath
from typing import Any, Callable, Dict, Iterable, Optional, Tuple, Union

from pathlib import Path
from typing import Any, Callable, Dict, Iterable, Optional, Tuple, Union, cast

from ..client import Client, register_client_class
from ..cloudpath import implementation_registry
Expand All @@ -20,6 +19,7 @@
BlobProperties,
ContentSettings,
generate_blob_sas,
BlobPrefix,
)
except ModuleNotFoundError:
implementation_registry["azure"].dependencies_loaded = False
Expand Down Expand Up @@ -155,7 +155,20 @@
return "dir"

try:
self._get_metadata(cloud_path)
# the result of get_blob_properties is a BlobProperties object (dict mixin)
metadata = cast(BlobProperties, self._get_metadata(cloud_path))

# content_type and content_md5 are never None for files in Azure Blob Storage
# if they are None, then the given path is a directory
# https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id#response-headers
is_folder = (
metadata.content_settings.content_type is None
and metadata.content_settings.content_md5 is None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to a more specific reference? I don't see the link you sent supporting the claim that this is definitive for folders.

It does seem like x-ms-resource-type is useful to answer the question directly, but only for certain configurations so could be included as an optimization. It may be the case that all the scenarios where those vars are none are also ones where x-ms-resource-type is available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the reference is sort of vague. I was mainly going of off these two statements:

Content-Type: The content type that's specified for the blob. If no content type is specified, the default content type is application/octet-stream.

Content-MD5: If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and later, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.

Using this, coupled with our observations from real-life usage, we concluded that these parameters are always None for folders and never None for blobs, even if they are empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your configuration have the x-ms-resource-type header for these folder blobs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That header does not seem to be returned within BlobProperties, nor can I see it in the in-browser Azure storage explorer.


if is_folder:
return "dir"

return "file"
except ResourceNotFoundError:
prefix = cloud_path.blob
Expand All @@ -181,17 +194,14 @@
def _list_dir(
self, cloud_path: AzureBlobPath, recursive: bool = False
) -> Iterable[Tuple[AzureBlobPath, bool]]:
# shortcut if listing all available containers
if not cloud_path.container:
if recursive:
raise NotImplementedError(
"Cannot recursively list all containers and contents; you can get all the containers then recursively list each separately."
)

yield from (
(self.CloudPath(f"az://{c.name}"), True)
for c in self.service_client.list_containers()
)
for container in self.service_client.list_containers():
yield self.CloudPath(f"az://{container.name}"), True

if not recursive:
continue

yield from self._list_dir(self.CloudPath(f"az://{container.name}"), recursive=True)

Check warning on line 204 in cloudpathlib/azure/azblobclient.py

View check run for this annotation

Codecov / codecov/patch

cloudpathlib/azure/azblobclient.py#L204

Added line #L204 was not covered by tests
return

container_client = self.service_client.get_container_client(cloud_path.container)
Expand All @@ -200,30 +210,30 @@
if prefix and not prefix.endswith("/"):
prefix += "/"

yielded_dirs = set()

# NOTE: Not recursive may be slower than necessary since it just filters
# the recursive implementation
for o in container_client.list_blobs(name_starts_with=prefix):
# get directory from this path
for parent in PurePosixPath(o.name[len(prefix) :]).parents:
# if we haven't surfaced this directory already
if parent not in yielded_dirs and str(parent) != ".":
# skip if not recursive and this is beyond our depth
if not recursive and "/" in str(parent):
continue

yield (
self.CloudPath(f"az://{cloud_path.container}/{prefix}{parent}"),
True, # is a directory
)
yielded_dirs.add(parent)

# skip file if not recursive and this is beyond our depth
if not recursive and "/" in o.name[len(prefix) :]:
continue

yield (self.CloudPath(f"az://{cloud_path.container}/{o.name}"), False) # is a file
if not recursive:
blobs = container_client.walk_blobs(name_starts_with=prefix)
M0dEx marked this conversation as resolved.
Show resolved Hide resolved
else:
blobs = container_client.list_blobs(name_starts_with=prefix)

def is_dir(blob: Union[BlobProperties, BlobPrefix]) -> bool:
# walk_blobs returns a BlobPrefix object for directories
# https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobprefix?view=azure-python
if isinstance(blob, BlobPrefix):
return True

Check warning on line 222 in cloudpathlib/azure/azblobclient.py

View check run for this annotation

Codecov / codecov/patch

cloudpathlib/azure/azblobclient.py#L222

Added line #L222 was not covered by tests

# content_type and content_md5 are never None for files in Azure Blob Storage
# if they are None, then the given path is a directory
# https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id#response-headers
return (
blob.content_settings.content_md5 is None
and blob.content_settings.content_type is None
)

for blob in blobs:
# walk_blobs returns folders with a trailing slash
blob_path = blob.name.rstrip("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it problematic to keep the trailing slashes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it to be consistent with the current behaviour, but we can simply not remove the trailing slashes.

blob_cloud_path = self.CloudPath(f"az://{cloud_path.container}/{blob_path}")
yield blob_cloud_path, is_dir(blob)

def _move_file(
self, src: AzureBlobPath, dst: AzureBlobPath, remove_src: bool = True
Expand Down
61 changes: 43 additions & 18 deletions tests/mock_clients/mock_azureblob.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from collections import namedtuple
from collections import namedtuple, defaultdict
from datetime import datetime
from pathlib import Path, PurePosixPath
import shutil
Expand All @@ -25,7 +25,12 @@ def __init__(self, *args, **kwargs):
self.tmp_path = Path(self.tmp.name) / "test_case_copy"
shutil.copytree(TEST_ASSETS, self.tmp_path / test_dir)

self.metadata_cache = {}
self.metadata_cache = defaultdict(
lambda: {
"content_type": "application/octet-stream",
"content_md5": "some_md5_hash",
}
)

@classmethod
def from_connection_string(cls, *args, **kwargs):
Expand Down Expand Up @@ -77,15 +82,22 @@ def url(self):

def get_blob_properties(self):
path = self.root / self.key
if path.exists() and path.is_file():
if path.exists():
# replicates the behavior of the Azure Blob API
# files always have content_type and content_md5
# directories never have content_type and content_md5
# https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id#response-headers
if path.is_file():
metadata = self.service_client.metadata_cache[path]
else:
metadata = {"content_type": None, "content_md5": None}

return BlobProperties(
**{
"name": self.key,
"Last-Modified": datetime.fromtimestamp(path.stat().st_mtime),
"ETag": "etag",
"content_type": self.service_client.metadata_cache.get(
self.root / self.key, None
),
**metadata,
}
)
else:
Expand Down Expand Up @@ -114,9 +126,13 @@ def upload_blob(self, data, overwrite, content_settings=None):
path.write_bytes(data.read())

if content_settings is not None:
self.service_client.metadata_cache[self.root / self.key] = (
content_settings.content_type
# content_type and content_md5 are never None for files in Azure Blob Storage
# https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id#response-headers
content_settings.content_type = (
content_settings.content_type or "application/octet-stream"
)
content_settings.content_md5 = content_settings.content_md5 or "some_md5_hash"
self.service_client.metadata_cache[path].update(content_settings)


class MockStorageStreamDownloader:
Expand Down Expand Up @@ -148,24 +164,31 @@ def exists(self):
def list_blobs(self, name_starts_with=None):
return mock_item_paged(self.root, name_starts_with)

def walk_blobs(self, name_starts_with=None):
return mock_item_paged(self.root, name_starts_with, recursive=False)

def delete_blobs(self, *blobs):
for blob in blobs:
(self.root / blob).unlink()
delete_empty_parents_up_to_root(path=self.root / blob, root=self.root)


def mock_item_paged(root, name_starts_with=None):
items = []

def mock_item_paged(root, name_starts_with=None, recursive=True):
if not name_starts_with:
name_starts_with = ""
for f in root.glob("**/*"):
if (
(not f.name.startswith("."))
and f.is_file()
and (root / name_starts_with) in [f, *f.parents]
):
items.append((PurePosixPath(f), f))

if recursive:
items = [
(PurePosixPath(f), f)
for f in root.glob("**/*")
if (
(not f.name.startswith("."))
and f.is_file()
and (root / name_starts_with) in [f, *f.parents]
)
]
else:
items = [(PurePosixPath(f), f) for f in (root / name_starts_with).iterdir()]

for mocked, local in items:
# BlobProperties
Expand All @@ -175,5 +198,7 @@ def mock_item_paged(root, name_starts_with=None):
"name": str(mocked.relative_to(PurePosixPath(root))),
"Last-Modified": datetime.fromtimestamp(local.stat().st_mtime),
"ETag": "etag",
"content_type": "application/octet-stream" if local.is_file() else None,
"content_md5": "some_md5_hash" if not local.is_file() else None,
}
)
Loading