Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(az): optimize non-recursive _list_dir #447

Closed
wants to merge 4 commits into from

Conversation

M0dEx
Copy link

@M0dEx M0dEx commented Jul 2, 2024

As described in #446, the current implementation of _list_dir in the AzureBlobClient can be painfully slow for directories with a high level of nesting, even if the result is supposed to be non-recursive.

This PR significantly improves the performance by using a newer portion of the Azure Blob Storage API, more specifically the BlobServiceClient.walk_blobs function, which allows for much faster non-recursive iteration over the contents of a directory. Additionally, it is now possible to recursively iterate over the root of an account - over all containers.

Furthermore, the detection of whether a blob is a file or a directory has been improved, using content_type and content_md5 properties as described by the Azure Blob API.

Closes #446.
Closes #444.


Performance before:

                                  Performance suite results: (2024-07-02T23:25:47.219870)                                  
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │         10 │ 0:00:01.153302 │ ± 0:00:00.058301 │ 0:00:01.305643 │   5,500 │
│ List Folders   │ List shallow non-recursive │         10 │ 0:00:01.131045 │ ± 0:00:00.019068 │ 0:00:01.168278 │   5,500 │
│ List Folders   │ List normal recursive      │         10 │ 0:00:01.549334 │ ± 0:00:00.016066 │ 0:00:01.581447 │   7,877 │
│ List Folders   │ List normal non-recursive  │         10 │ 0:00:01.463014 │ ± 0:00:00.022756 │ 0:00:01.503776 │     113 │
│ List Folders   │ List deep recursive        │         10 │ 0:00:02.002217 │ ± 0:00:00.032541 │ 0:00:02.059980 │   7,955 │
│ List Folders   │ List deep non-recursive    │         10 │ 0:00:02.002843 │ ± 0:00:00.021002 │ 0:00:02.038868 │      31 │
│ Glob scenarios │ Glob shallow recursive     │         10 │ 0:00:01.233810 │ ± 0:00:00.021087 │ 0:00:01.263742 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │         10 │ 0:00:01.223638 │ ± 0:00:00.022161 │ 0:00:01.255687 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │         10 │ 0:00:01.693786 │ ± 0:00:00.017944 │ 0:00:01.719750 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │         10 │ 0:00:01.482890 │ ± 0:00:00.013313 │ 0:00:01.495936 │      12 │
│ Glob scenarios │ Glob deep recursive        │         10 │ 0:00:02.246799 │ ± 0:00:00.018432 │ 0:00:02.277186 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │         10 │ 0:00:02.019131 │ ± 0:00:00.018071 │ 0:00:02.046828 │      25 │
│ Walk scenarios │ Walk shallow               │         10 │ 0:00:01.130795 │ ± 0:00:00.022412 │ 0:00:01.171678 │   5,500 │
│ Walk scenarios │ Walk normal                │         10 │ 0:00:01.627434 │ ± 0:00:00.018168 │ 0:00:01.650090 │   7,272 │
│ Walk scenarios │ Walk deep                  │         10 │ 0:00:02.088680 │ ± 0:00:00.044334 │ 0:00:02.156224 │   7,650 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

Performance after:

                                  Performance suite results: (2024-07-03T09:40:35.151549)                                  
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │         10 │ 0:00:01.148837 │ ± 0:00:00.034477 │ 0:00:01.218684 │   5,500 │
│ List Folders   │ List shallow non-recursive │         10 │ 0:00:01.212517 │ ± 0:00:00.033842 │ 0:00:01.276382 │   5,500 │
│ List Folders   │ List normal recursive      │         10 │ 0:00:01.532350 │ ± 0:00:00.017394 │ 0:00:01.550038 │   7,272 │
│ List Folders   │ List normal non-recursive  │         10 │ 0:00:00.023452 │ ± 0:00:00.009190 │ 0:00:00.049427 │     113 │
│ List Folders   │ List deep recursive        │         10 │ 0:00:01.707918 │ ± 0:00:00.036464 │ 0:00:01.782752 │   7,650 │
│ List Folders   │ List deep non-recursive    │         10 │ 0:00:00.028419 │ ± 0:00:00.004629 │ 0:00:00.039337 │      31 │
│ Glob scenarios │ Glob shallow recursive     │         10 │ 0:00:01.261963 │ ± 0:00:00.034387 │ 0:00:01.343164 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │         10 │ 0:00:01.299316 │ ± 0:00:00.024798 │ 0:00:01.353754 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │         10 │ 0:00:01.677940 │ ± 0:00:00.021779 │ 0:00:01.717995 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │         10 │ 0:00:00.022214 │ ± 0:00:00.000882 │ 0:00:00.023246 │      12 │
│ Glob scenarios │ Glob deep recursive        │         10 │ 0:00:01.900239 │ ± 0:00:00.023626 │ 0:00:01.947988 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │         10 │ 0:00:00.026393 │ ± 0:00:00.000928 │ 0:00:00.028201 │      25 │
│ Walk scenarios │ Walk shallow               │         10 │ 0:00:01.148723 │ ± 0:00:00.019138 │ 0:00:01.190942 │   5,500 │
│ Walk scenarios │ Walk normal                │         10 │ 0:00:01.555491 │ ± 0:00:00.016058 │ 0:00:01.580223 │   7,272 │
│ Walk scenarios │ Walk deep                  │         10 │ 0:00:01.707682 │ ± 0:00:00.017401 │ 0:00:01.733078 │   7,650 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

@M0dEx M0dEx changed the title perf(az): optimize non-recursive _list_dir perf(az): optimize non-recursive _list_dir Jul 2, 2024
@pjbull
Copy link
Member

pjbull commented Jul 2, 2024

Thanks, @M0dEx. Could you also run before and after perf numbers for Azure?

You'll need to run this command but with az instead of s3:
https://github.com/drivendataorg/cloudpathlib/blob/master/Makefile#L89-L90

You should get a table you can paste into the PR like this:
https://cloudpathlib.drivendata.org/stable/contributing/#performance-testing

@M0dEx
Copy link
Author

M0dEx commented Jul 3, 2024

Thanks, @M0dEx. Could you also run before and after perf numbers for Azure?

You'll need to run this command but with az instead of s3: https://github.com/drivendataorg/cloudpathlib/blob/master/Makefile#L89-L90

You should get a table you can paste into the PR like this: https://cloudpathlib.drivendata.org/stable/contributing/#performance-testing

Added performance tests before and after.

@jayqi
Copy link
Member

jayqi commented Jul 3, 2024

Looks like the performance generally is the same, with normal and deep cases being much faster for List and Glob as advertised. 👍

@M0dEx
Copy link
Author

M0dEx commented Jul 3, 2024

Pushed a fix for the failing tests (apparently caused by the |= operator not working as expected in Python 3.8).

@M0dEx
Copy link
Author

M0dEx commented Jul 3, 2024

Removed the | union as well, since it is only supported since Python 3.10.

Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.3%. Comparing base (08b018b) to head (2783e38).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
cloudpathlib/azure/azblobclient.py 90.9% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #447     +/-   ##
========================================
- Coverage    93.7%   93.3%   -0.5%     
========================================
  Files          23      23             
  Lines        1654    1659      +5     
========================================
- Hits         1551    1548      -3     
- Misses        103     111      +8     
Files with missing lines Coverage Δ
cloudpathlib/azure/azblobclient.py 93.5% <90.9%> (-1.3%) ⬇️

... and 3 files with indirect coverage changes

Copy link
Member

@pjbull pjbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a few questions about the approach here. Preferably we can use a more explicitly supported version of things.

cloudpathlib/azure/azblobclient.py Show resolved Hide resolved
is_folder = (
metadata.content_settings.content_type is None
and metadata.content_settings.content_md5 is None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to a more specific reference? I don't see the link you sent supporting the claim that this is definitive for folders.

It does seem like x-ms-resource-type is useful to answer the question directly, but only for certain configurations so could be included as an optimization. It may be the case that all the scenarios where those vars are none are also ones where x-ms-resource-type is available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the reference is sort of vague. I was mainly going of off these two statements:

Content-Type: The content type that's specified for the blob. If no content type is specified, the default content type is application/octet-stream.

Content-MD5: If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and later, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.

Using this, coupled with our observations from real-life usage, we concluded that these parameters are always None for folders and never None for blobs, even if they are empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your configuration have the x-ms-resource-type header for these folder blobs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That header does not seem to be returned within BlobProperties, nor can I see it in the in-browser Azure storage explorer.


for blob in blobs:
# walk_blobs returns folders with a trailing slash
blob_path = blob.name.rstrip("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it problematic to keep the trailing slashes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it to be consistent with the current behaviour, but we can simply not remove the trailing slashes.

@pjbull pjbull mentioned this pull request Jul 17, 2024
4 tasks
@pjbull
Copy link
Member

pjbull commented Aug 28, 2024

Incorporated into #453, thanks!

@pjbull pjbull closed this Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants