Skip to content

POC: Add an inexpensive test to set anon if no creds are found in gcs #985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 77 additions & 3 deletions src/datachain/client/gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
# Patch gcsfs for consistency with s3fs
GCSFileSystem.set_session = GCSFileSystem._set_session
PageQueue = asyncio.Queue[Optional[Iterable[dict[str, Any]]]]
ANONYMOUS_TOKEN = "anon" # noqa: S105


class GCSClient(Client):
Expand All @@ -24,12 +25,85 @@
protocol = "gs"

@classmethod
def create_fs(cls, **kwargs) -> GCSFileSystem:
def _try_authenticate(cls, gcreds, method: str) -> bool:
"""Attempt to authenticate using the specified method.

Args:
gcreds: Google credentials object
method: Authentication method to try

Returns:
bool: True if authentication succeeded, False otherwise
"""
from google.auth.exceptions import GoogleAuthError

Check warning on line 38 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L38

Added line #L38 was not covered by tests

try:
gcreds.connect(method=method)
return True
except (GoogleAuthError, ValueError):

Check warning on line 43 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L40-L43

Added lines #L40 - L43 were not covered by tests
# Reset credentials if authentication failed (reverts to 'anon' behavior)
gcreds.credentials = None
return False

Check warning on line 46 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L45-L46

Added lines #L45 - L46 were not covered by tests

@classmethod
def _get_default_credentials(cls, **kwargs) -> dict[str, Any]:
"""Get default GCS credentials using various authentication methods.

Returns:
dict: Updated kwargs with appropriate token
"""
from gcsfs.core import DEFAULT_PROJECT
from gcsfs.credentials import GoogleCredentials
from google.auth.compute_engine._metadata import is_on_gce

def request_callback(*args, **kwargs):
from google.auth.exceptions import TransportError

Check warning on line 60 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L60

Added line #L60 was not covered by tests

raise TransportError("Skip metadata check")

Check warning on line 62 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L62

Added line #L62 was not covered by tests

# If credentials provided in env var, use those
if os.environ.get("DATACHAIN_GCP_CREDENTIALS"):
kwargs["token"] = json.loads(os.environ["DATACHAIN_GCP_CREDENTIALS"])
if kwargs.pop("anon", False):
kwargs["token"] = "anon" # noqa: S105
return kwargs

# If token is provided, use it
if "token" in kwargs:
return kwargs

Check warning on line 71 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L71

Added line #L71 was not covered by tests

# If anonymous access requested, force anonymous access
if kwargs.get("anon"):
kwargs["token"] = ANONYMOUS_TOKEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a way fsspec internally distinguishes it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

return kwargs

# Try various authentication methods
gcreds = GoogleCredentials(

Check warning on line 79 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L79

Added line #L79 was not covered by tests
token=ANONYMOUS_TOKEN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we pass anon token here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to initialize the creds. If anon is not passed, it will try to connect with other approaches.

project=DEFAULT_PROJECT,
access="full_control",
)

# Define authentication methods to try
auth_methods = ["google_default", "cache"]

Check warning on line 86 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L86

Added line #L86 was not covered by tests
if is_on_gce(request_callback):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amritghimire what is the logic behind the request_callback? will always raise or only in certain cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to disable the connection to metadata server. It raises exception and the is_on_gce will check for existence of a file to determine if certain file exist. You can look into the code of is_on_gce for more context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't it be disabled on the fsspec side? does it mean that metadata server can be ignored, or does it mean that it won't be retrying anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be retrying it anymore.
Also, this is just a proof of concept PR, if we want to go this route or not. Needs refinement if we want to go this way. Created a PR this way so it is easier to view the difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why can't we suggest the same improvement to the fsspec itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we go that route, let's see if this works for us or not first internally IMO.

This is a hacky fix for now to be sure.

@0x2b3bfa0 may have more context on this. If we can skip the metadata check or not.

I was wondering is it safe for datachain to skip the metadata check altogether or not?

So, users on google compute engine could use token to cloud to force this way of communication.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got sidetracked by a lot of things and I lack enough context to respond — and also lack time to get that context. Please ping me in a few day if you still need my input. Sorry! 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering is it safe for datachain to skip the metadata check altogether or not?

answer should be the same as for fsspec to my mind, or google lib

So, users on google compute engine could use token to cloud to force this way of communication.

they could, but probably won't expect the discrepancy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, this is a hacky fix. We can look for proper way to fix this or look for proper way either to suggest it to fsspec or google lib or use a request which doesn't retry for 5 times when it fails for gce check.

We first need to figure out or discuss if we want to implement something like this. @shcheklein cc. @iterative/datachain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want some hacky fix like this here, I don't think we should spend more time to see how to skip the metadata check here.

auth_methods.append("cloud")

Check warning on line 88 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L88

Added line #L88 was not covered by tests

# Try each authentication method
for method in auth_methods:
if cls._try_authenticate(gcreds, method):
return kwargs

Check warning on line 93 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L93

Added line #L93 was not covered by tests

# If no authentication method worked, use anonymous access
kwargs["token"] = ANONYMOUS_TOKEN
return kwargs

Check warning on line 97 in src/datachain/client/gcs.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/client/gcs.py#L96-L97

Added lines #L96 - L97 were not covered by tests

@classmethod
def create_fs(cls, **kwargs) -> GCSFileSystem:
"""Create a GCS filesystem with appropriate authentication.

Returns:
GCSFileSystem: Authenticated filesystem object
"""
kwargs = cls._get_default_credentials(**kwargs)
return cast("GCSFileSystem", super().create_fs(**kwargs))

def url(self, path: str, expires: int = 3600, **kwargs) -> str:
Expand Down
Loading