Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42704: Support multiple S3 endpoints #82

Merged
merged 5 commits into from
Feb 2, 2024
Merged

DM-42704: Support multiple S3 endpoints #82

merged 5 commits into from
Feb 2, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Jan 29, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (fe12363) 86.79% compared to head (fc81205) 86.99%.

Files Patch % Lines
python/lsst/resources/s3.py 86.11% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   86.79%   86.99%   +0.20%     
==========================================
  Files          27       27              
  Lines        4232     4330      +98     
  Branches      860      880      +20     
==========================================
+ Hits         3673     3767      +94     
- Misses        416      418       +2     
- Partials      143      145       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-42704 branch 4 times, most recently from 21094e2 to 3dc6344 Compare January 31, 2024 22:22
Allow S3 URLs in the form "s3://profile@bucket/...", with profiles configured via environment variables LSST_RESOURCES_S3_PROFILE_<profile>.  This allows users to access multiple S3 services simultaneously.
Fix an issue where multi-tenant ceph S3 bucket names would not be correctly parsed, since they include a colon.
The new version of Black generates formatting that flake8 disagrees with.  Ruff refuses to let you add noqa lines to disable flake8 checks if it feels that the noqa lines are unnecessary.  So one of the three tools has to go.
@dhirving dhirving marked this pull request as ready for review January 31, 2024 22:55
endpoint = os.environ.get(var_name, None)
if not endpoint:
raise RuntimeError(
f"No configuration found for requested S3 profile '{profile}'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3 can retrieve profiles from the AWS_SHARED_CREDENTIALS_FILE if all of them have the same S3_ENDPOINT_URL. I think we need to allow this capability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying you want support for passing the profile name to boto for looking up the credentials? As-is it will already retrieve the default credentials from there if you specify only the endpoint URL as either LSST_RESOURCES_S3_PROFILE_profile=http://onlytheendpoint-with-no-credentials.com or S3_ENDPOINT_URL=http://onlytheendpoint.com.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that since we already can have multiple credentials in the AWS_SHARED_CREDENTIALS_FILE that we should continue to use them, especially if they all share a single S3_ENDPOINT_URL but even if they do not. So if the LSST_RESOURCES_S3_PROFILE_{profile} env var doesn't exist, or if it lacks {access}:{secret}, pass the profile name to boto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, easy enough

Per KT request, also allow S3 profile credentials to be read from AWS credentials files using boto3's built-in lookup logic.
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. My main comment (other than caching) is these public properties that are really private and whether we should either make them completely private or add public username/hostname to all of ResourcePath.


@property
def bucket(self) -> str:
split = self._uri.netloc.split("@")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You aren't using self._uri.hostname because you are worried that two @ might be in the netloc? Alternative is to check for an @ in username.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because it doesn't work for ceph multi-tenant buckets "tenant:bucket". The part after the colon is parsed as the port number, and urllib throws an exception when you try to access the port if it isn't numeric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about that to the code so the next person who looks isn't confused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a week of looking at profiles in ResourcePath I'm a bit paranoid, although cached_property wouldn't give us much here because the real gain would come from using lru_cache for the netloc parsing outside of the class (since we likely only use a couple of S3 netloc values at most.

For the record adding cached_property with this code makes it run 4 times faster than not using cached_property (but it's 115ns vs 20ns).

def profile(self) -> str | None:
return self._uri.username

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not @cached_property or something? We only need to do the split and check the first time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we cache this? It does like zero work, and all of its consumers go off and do network I/O immediately after calling this.

Additionally, I can't imagine that the logic in cached_property for doing the caching is much faster than re-doing the split of a 60 byte string.

return getS3Client(self.profile)

@property
def profile(self) -> str | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring with a quick description (one line is fine). We have to hope that people won't start thinking these two properties are public and rely on them, even though they are scheme-dependent. We could make them public like ParseResult does by calling them hostname and username and then not being specific to this scheme at all. Or what do you think about calling these _profile and _bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with _profile and _bucket, I'll change it.

getS3Client()
profiles = set[str | None]()
for path in uris:
if path.scheme == "s3":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the base class already enforces this when it calls _mexists.

if not bucket:
raise ValueError(f"S3 URI does not include bucket name: '{str(self)}'")

return bucket

@classmethod
def _mexists(cls, uris: Iterable[ResourcePath]) -> dict[ResourcePath, bool]:
# Force client to be created before creating threads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Force client to be created before creating threads.
# Force client to be created for each profile before creating threads.

from unittest.mock import patch

from botocore.exceptions import ClientError
from botocore.handlers import validate_bucket_name
from deprecated.sphinx import deprecated
from urllib3.exceptions import HTTPError, RequestError
from urllib3.util import Url, parse_url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we meant to be using parse_url rather than urlparse these days?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used it because we already have a dependency on it and it has more useful semantics for what I was trying to do... the equivalent code for getting the username/password part and reconstructing the URL without it is a lot uglier with urllib.parse because it treats the entire netloc as a single unit.

Add _ prefix to bucket and profile properties in S3ResourcePath
@dhirving dhirving merged commit cda7a56 into main Feb 2, 2024
16 checks passed
@dhirving dhirving deleted the tickets/DM-42704 branch February 2, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants