DM-42704: Support multiple S3 endpoints #82

dhirving · 2024-01-29T23:37:36Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2024-01-31T19:02:15Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (fe12363) 86.79% compared to head (fc81205) 86.99%.

Files	Patch %	Lines
python/lsst/resources/s3.py	86.11%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   86.79%   86.99%   +0.20%     
==========================================
  Files          27       27              
  Lines        4232     4330      +98     
  Branches      860      880      +20     
==========================================
+ Hits         3673     3767      +94     
- Misses        416      418       +2     
- Partials      143      145       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Allow S3 URLs in the form "s3://profile@bucket/...", with profiles configured via environment variables LSST_RESOURCES_S3_PROFILE_<profile>. This allows users to access multiple S3 services simultaneously.

Fix an issue where multi-tenant ceph S3 bucket names would not be correctly parsed, since they include a colon.

The new version of Black generates formatting that flake8 disagrees with. Ruff refuses to let you add noqa lines to disable flake8 checks if it feels that the noqa lines are unnecessary. So one of the three tools has to go.

ktlim · 2024-01-31T23:51:56Z

python/lsst/resources/s3utils.py

+        endpoint = os.environ.get(var_name, None)
+        if not endpoint:
+            raise RuntimeError(
+                f"No configuration found for requested S3 profile '{profile}'."


boto3 can retrieve profiles from the AWS_SHARED_CREDENTIALS_FILE if all of them have the same S3_ENDPOINT_URL. I think we need to allow this capability.

So you're saying you want support for passing the profile name to boto for looking up the credentials? As-is it will already retrieve the default credentials from there if you specify only the endpoint URL as either LSST_RESOURCES_S3_PROFILE_profile=http://onlytheendpoint-with-no-credentials.com or S3_ENDPOINT_URL=http://onlytheendpoint.com.

I was thinking that since we already can have multiple credentials in the AWS_SHARED_CREDENTIALS_FILE that we should continue to use them, especially if they all share a single S3_ENDPOINT_URL but even if they do not. So if the LSST_RESOURCES_S3_PROFILE_{profile} env var doesn't exist, or if it lacks {access}:{secret}, pass the profile name to boto.

OK, easy enough

Per KT request, also allow S3 profile credentials to be read from AWS credentials files using boto3's built-in lookup logic.

timj

Looks great. My main comment (other than caching) is these public properties that are really private and whether we should either make them completely private or add public username/hostname to all of ResourcePath.

timj · 2024-02-01T20:25:31Z

python/lsst/resources/s3.py

+
+    @property
+    def bucket(self) -> str:
+        split = self._uri.netloc.split("@")


You aren't using self._uri.hostname because you are worried that two @ might be in the netloc? Alternative is to check for an @ in username.

No, because it doesn't work for ceph multi-tenant buckets "tenant:bucket". The part after the colon is parsed as the port number, and urllib throws an exception when you try to access the port if it isn't numeric.

Can you add a comment about that to the code so the next person who looks isn't confused?

After a week of looking at profiles in ResourcePath I'm a bit paranoid, although cached_property wouldn't give us much here because the real gain would come from using lru_cache for the netloc parsing outside of the class (since we likely only use a couple of S3 netloc values at most.

For the record adding cached_property with this code makes it run 4 times faster than not using cached_property (but it's 115ns vs 20ns).

timj · 2024-02-01T20:33:16Z

python/lsst/resources/s3.py

+    def profile(self) -> str | None:
+        return self._uri.username
+
+    @property


Not @cached_property or something? We only need to do the split and check the first time.

Why would we cache this? It does like zero work, and all of its consumers go off and do network I/O immediately after calling this.

Additionally, I can't imagine that the logic in cached_property for doing the caching is much faster than re-doing the split of a 60 byte string.

timj · 2024-02-01T20:35:44Z

python/lsst/resources/s3.py

+        return getS3Client(self.profile)
+
+    @property
+    def profile(self) -> str | None:


Docstring with a quick description (one line is fine). We have to hope that people won't start thinking these two properties are public and rely on them, even though they are scheme-dependent. We could make them public like ParseResult does by calling them hostname and username and then not being specific to this scheme at all. Or what do you think about calling these _profile and _bucket?

I'm fine with _profile and _bucket, I'll change it.

timj · 2024-02-01T20:38:58Z

python/lsst/resources/s3.py

-        getS3Client()
+        profiles = set[str | None]()
+        for path in uris:
+            if path.scheme == "s3":


In theory the base class already enforces this when it calls _mexists.

timj · 2024-02-01T20:39:45Z

python/lsst/resources/s3.py

+        if not bucket:
+            raise ValueError(f"S3 URI does not include bucket name: '{str(self)}'")
+
+        return bucket

    @classmethod
    def _mexists(cls, uris: Iterable[ResourcePath]) -> dict[ResourcePath, bool]:
        # Force client to be created before creating threads.


Suggested change

# Force client to be created before creating threads.

# Force client to be created for each profile before creating threads.

timj · 2024-02-01T20:54:01Z

python/lsst/resources/s3utils.py

 from unittest.mock import patch

 from botocore.exceptions import ClientError
 from botocore.handlers import validate_bucket_name
 from deprecated.sphinx import deprecated
 from urllib3.exceptions import HTTPError, RequestError
+from urllib3.util import Url, parse_url


Are we meant to be using parse_url rather than urlparse these days?

I just used it because we already have a dependency on it and it has more useful semantics for what I was trying to do... the equivalent code for getting the username/password part and reconstructing the URL without it is a lot uglier with urllib.parse because it treats the entire netloc as a single unit.

Add _ prefix to bucket and profile properties in S3ResourcePath

dhirving force-pushed the tickets/DM-42704 branch from 521ae80 to 5ca8aa9 Compare January 31, 2024 18:59

dhirving force-pushed the tickets/DM-42704 branch 4 times, most recently from 21094e2 to 3dc6344 Compare January 31, 2024 22:22

dhirving added 3 commits January 31, 2024 15:52

Allow multiple S3 endpoints

5b7c62b

Allow S3 URLs in the form "s3://profile@bucket/...", with profiles configured via environment variables LSST_RESOURCES_S3_PROFILE_<profile>. This allows users to access multiple S3 services simultaneously.

Handle ceph bucket names

0b657af

Fix an issue where multi-tenant ceph S3 bucket names would not be correctly parsed, since they include a colon.

Disable flake8 linting

82b9fb9

The new version of Black generates formatting that flake8 disagrees with. Ruff refuses to let you add noqa lines to disable flake8 checks if it feels that the noqa lines are unnecessary. So one of the three tools has to go.

dhirving force-pushed the tickets/DM-42704 branch from 3dc6344 to 82b9fb9 Compare January 31, 2024 22:52

dhirving marked this pull request as ready for review January 31, 2024 22:55

ktlim reviewed Jan 31, 2024

View reviewed changes

Allow reading S3 profile credentials from file

09b13aa

Per KT request, also allow S3 profile credentials to be read from AWS credentials files using boto3's built-in lookup logic.

timj approved these changes Feb 1, 2024

View reviewed changes

Make bucket and profile private

fc81205

Add _ prefix to bucket and profile properties in S3ResourcePath

dhirving force-pushed the tickets/DM-42704 branch from 8bf2de8 to fc81205 Compare February 1, 2024 22:21

dhirving merged commit cda7a56 into main Feb 2, 2024
16 checks passed

dhirving deleted the tickets/DM-42704 branch February 2, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42704: Support multiple S3 endpoints #82

DM-42704: Support multiple S3 endpoints #82

dhirving commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

ktlim Jan 31, 2024

dhirving Jan 31, 2024

ktlim Feb 1, 2024

dhirving Feb 1, 2024

timj left a comment

timj Feb 1, 2024

dhirving Feb 1, 2024

timj Feb 1, 2024

timj Feb 1, 2024

timj Feb 1, 2024

dhirving Feb 1, 2024

timj Feb 1, 2024

dhirving Feb 1, 2024

timj Feb 1, 2024

timj Feb 1, 2024

timj Feb 1, 2024

dhirving Feb 1, 2024

	# Force client to be created before creating threads.
	# Force client to be created for each profile before creating threads.

DM-42704: Support multiple S3 endpoints #82

DM-42704: Support multiple S3 endpoints #82

Conversation

dhirving commented Jan 29, 2024 • edited Loading

Checklist

codecov bot commented Jan 31, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading