-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42704: Support multiple S3 endpoints #82
Conversation
521ae80
to
5ca8aa9
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #82 +/- ##
==========================================
+ Coverage 86.79% 86.99% +0.20%
==========================================
Files 27 27
Lines 4232 4330 +98
Branches 860 880 +20
==========================================
+ Hits 3673 3767 +94
- Misses 416 418 +2
- Partials 143 145 +2 ☔ View full report in Codecov by Sentry. |
21094e2
to
3dc6344
Compare
Allow S3 URLs in the form "s3://profile@bucket/...", with profiles configured via environment variables LSST_RESOURCES_S3_PROFILE_<profile>. This allows users to access multiple S3 services simultaneously.
Fix an issue where multi-tenant ceph S3 bucket names would not be correctly parsed, since they include a colon.
The new version of Black generates formatting that flake8 disagrees with. Ruff refuses to let you add noqa lines to disable flake8 checks if it feels that the noqa lines are unnecessary. So one of the three tools has to go.
3dc6344
to
82b9fb9
Compare
python/lsst/resources/s3utils.py
Outdated
endpoint = os.environ.get(var_name, None) | ||
if not endpoint: | ||
raise RuntimeError( | ||
f"No configuration found for requested S3 profile '{profile}'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boto3 can retrieve profiles from the AWS_SHARED_CREDENTIALS_FILE
if all of them have the same S3_ENDPOINT_URL
. I think we need to allow this capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying you want support for passing the profile name to boto for looking up the credentials? As-is it will already retrieve the default credentials from there if you specify only the endpoint URL as either LSST_RESOURCES_S3_PROFILE_profile=http://onlytheendpoint-with-no-credentials.com
or S3_ENDPOINT_URL=http://onlytheendpoint.com
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that since we already can have multiple credentials in the AWS_SHARED_CREDENTIALS_FILE
that we should continue to use them, especially if they all share a single S3_ENDPOINT_URL
but even if they do not. So if the LSST_RESOURCES_S3_PROFILE_{profile}
env var doesn't exist, or if it lacks {access}:{secret}
, pass the profile name to boto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, easy enough
Per KT request, also allow S3 profile credentials to be read from AWS credentials files using boto3's built-in lookup logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. My main comment (other than caching) is these public properties that are really private and whether we should either make them completely private or add public username/hostname to all of ResourcePath.
|
||
@property | ||
def bucket(self) -> str: | ||
split = self._uri.netloc.split("@") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You aren't using self._uri.hostname
because you are worried that two @
might be in the netloc? Alternative is to check for an @
in username.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because it doesn't work for ceph multi-tenant buckets "tenant:bucket". The part after the colon is parsed as the port number, and urllib throws an exception when you try to access the port if it isn't numeric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about that to the code so the next person who looks isn't confused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a week of looking at profiles in ResourcePath
I'm a bit paranoid, although cached_property
wouldn't give us much here because the real gain would come from using lru_cache for the netloc parsing outside of the class (since we likely only use a couple of S3 netloc values at most.
For the record adding cached_property with this code makes it run 4 times faster than not using cached_property (but it's 115ns vs 20ns).
def profile(self) -> str | None: | ||
return self._uri.username | ||
|
||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not @cached_property
or something? We only need to do the split and check the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we cache this? It does like zero work, and all of its consumers go off and do network I/O immediately after calling this.
Additionally, I can't imagine that the logic in cached_property for doing the caching is much faster than re-doing the split of a 60 byte string.
python/lsst/resources/s3.py
Outdated
return getS3Client(self.profile) | ||
|
||
@property | ||
def profile(self) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring with a quick description (one line is fine). We have to hope that people won't start thinking these two properties are public and rely on them, even though they are scheme-dependent. We could make them public like ParseResult
does by calling them hostname
and username
and then not being specific to this scheme at all. Or what do you think about calling these _profile
and _bucket
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with _profile and _bucket, I'll change it.
getS3Client() | ||
profiles = set[str | None]() | ||
for path in uris: | ||
if path.scheme == "s3": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory the base class already enforces this when it calls _mexists
.
python/lsst/resources/s3.py
Outdated
if not bucket: | ||
raise ValueError(f"S3 URI does not include bucket name: '{str(self)}'") | ||
|
||
return bucket | ||
|
||
@classmethod | ||
def _mexists(cls, uris: Iterable[ResourcePath]) -> dict[ResourcePath, bool]: | ||
# Force client to be created before creating threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Force client to be created before creating threads. | |
# Force client to be created for each profile before creating threads. |
from unittest.mock import patch | ||
|
||
from botocore.exceptions import ClientError | ||
from botocore.handlers import validate_bucket_name | ||
from deprecated.sphinx import deprecated | ||
from urllib3.exceptions import HTTPError, RequestError | ||
from urllib3.util import Url, parse_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we meant to be using parse_url
rather than urlparse
these days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used it because we already have a dependency on it and it has more useful semantics for what I was trying to do... the equivalent code for getting the username/password part and reconstructing the URL without it is a lot uglier with urllib.parse because it treats the entire netloc as a single unit.
Add _ prefix to bucket and profile properties in S3ResourcePath
8bf2de8
to
fc81205
Compare
Checklist
doc/changes