-
Notifications
You must be signed in to change notification settings - Fork 114
POC: Add an inexpensive test to set anon if no creds are found in gcs #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
# Patch gcsfs for consistency with s3fs | ||
GCSFileSystem.set_session = GCSFileSystem._set_session | ||
PageQueue = asyncio.Queue[Optional[Iterable[dict[str, Any]]]] | ||
ANONYMOUS_TOKEN = "anon" # noqa: S105 | ||
|
||
|
||
class GCSClient(Client): | ||
|
@@ -24,12 +25,85 @@ | |
protocol = "gs" | ||
|
||
@classmethod | ||
def create_fs(cls, **kwargs) -> GCSFileSystem: | ||
def _try_authenticate(cls, gcreds, method: str) -> bool: | ||
"""Attempt to authenticate using the specified method. | ||
|
||
Args: | ||
gcreds: Google credentials object | ||
method: Authentication method to try | ||
|
||
Returns: | ||
bool: True if authentication succeeded, False otherwise | ||
""" | ||
from google.auth.exceptions import GoogleAuthError | ||
|
||
try: | ||
gcreds.connect(method=method) | ||
return True | ||
except (GoogleAuthError, ValueError): | ||
# Reset credentials if authentication failed (reverts to 'anon' behavior) | ||
gcreds.credentials = None | ||
return False | ||
|
||
@classmethod | ||
def _get_default_credentials(cls, **kwargs) -> dict[str, Any]: | ||
"""Get default GCS credentials using various authentication methods. | ||
|
||
Returns: | ||
dict: Updated kwargs with appropriate token | ||
""" | ||
from gcsfs.core import DEFAULT_PROJECT | ||
from gcsfs.credentials import GoogleCredentials | ||
from google.auth.compute_engine._metadata import is_on_gce | ||
|
||
def request_callback(*args, **kwargs): | ||
from google.auth.exceptions import TransportError | ||
|
||
raise TransportError("Skip metadata check") | ||
|
||
# If credentials provided in env var, use those | ||
if os.environ.get("DATACHAIN_GCP_CREDENTIALS"): | ||
kwargs["token"] = json.loads(os.environ["DATACHAIN_GCP_CREDENTIALS"]) | ||
if kwargs.pop("anon", False): | ||
kwargs["token"] = "anon" # noqa: S105 | ||
return kwargs | ||
|
||
# If token is provided, use it | ||
if "token" in kwargs: | ||
return kwargs | ||
|
||
# If anonymous access requested, force anonymous access | ||
if kwargs.get("anon"): | ||
kwargs["token"] = ANONYMOUS_TOKEN | ||
return kwargs | ||
|
||
# Try various authentication methods | ||
gcreds = GoogleCredentials( | ||
token=ANONYMOUS_TOKEN, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we pass anon token here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is to initialize the creds. If anon is not passed, it will try to connect with other approaches. |
||
project=DEFAULT_PROJECT, | ||
access="full_control", | ||
) | ||
|
||
# Define authentication methods to try | ||
auth_methods = ["google_default", "cache"] | ||
if is_on_gce(request_callback): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @amritghimire what is the logic behind the request_callback? will always raise or only in certain cases? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is to disable the connection to metadata server. It raises exception and the is_on_gce will check for existence of a file to determine if certain file exist. You can look into the code of is_on_gce for more context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why can't it be disabled on the fsspec side? does it mean that metadata server can be ignored, or does it mean that it won't be retrying anymore? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It won't be retrying it anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, why can't we suggest the same improvement to the fsspec itself? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before we go that route, let's see if this works for us or not first internally IMO. This is a hacky fix for now to be sure. @0x2b3bfa0 may have more context on this. If we can skip the metadata check or not. I was wondering is it safe for datachain to skip the metadata check altogether or not? So, users on google compute engine could use token to cloud to force this way of communication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got sidetracked by a lot of things and I lack enough context to respond — and also lack time to get that context. Please ping me in a few day if you still need my input. Sorry! 🙈 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
answer should be the same as for fsspec to my mind, or google lib
they could, but probably won't expect the discrepancy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned, this is a hacky fix. We can look for proper way to fix this or look for proper way either to suggest it to fsspec or google lib or use a request which doesn't retry for 5 times when it fails for gce check. We first need to figure out or discuss if we want to implement something like this. @shcheklein cc. @iterative/datachain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we don't want some hacky fix like this here, I don't think we should spend more time to see how to skip the metadata check here. |
||
auth_methods.append("cloud") | ||
|
||
# Try each authentication method | ||
for method in auth_methods: | ||
if cls._try_authenticate(gcreds, method): | ||
return kwargs | ||
|
||
# If no authentication method worked, use anonymous access | ||
kwargs["token"] = ANONYMOUS_TOKEN | ||
return kwargs | ||
|
||
@classmethod | ||
def create_fs(cls, **kwargs) -> GCSFileSystem: | ||
"""Create a GCS filesystem with appropriate authentication. | ||
|
||
Returns: | ||
GCSFileSystem: Authenticated filesystem object | ||
""" | ||
kwargs = cls._get_default_credentials(**kwargs) | ||
return cast("GCSFileSystem", super().create_fs(**kwargs)) | ||
|
||
def url(self, path: str, expires: int = 3600, **kwargs) -> str: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a way fsspec internally distinguishes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,