Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP missing creds slows down significantly #740

Open
2 tasks
dmpetrov opened this issue Dec 24, 2024 · 7 comments · May be fixed by #985
Open
2 tasks

GCP missing creds slows down significantly #740

dmpetrov opened this issue Dec 24, 2024 · 7 comments · May be fixed by #985
Assignees
Labels
bug Something isn't working

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Dec 24, 2024

Description

See quick start.

This command take 20 seconds (sometimes 30) without any progressbar if I don't have any GCP creds setup. If I put anon=True - it works instantly like it suppose to.

meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")

This affects all users since we use GCP in most of our examples and, I'm sure, majority of user do not have GCP creds set up.

What's needed: we need to avoid the delay when GCP creds are not set up. Ideally, without specifying anon=True which adds more code in examples and mental overhead for users.

ToDo:

  • Fix the issue
  • Remove anon=True from Readme and Get Started
@dmpetrov dmpetrov added the bug Something isn't working label Dec 24, 2024
dmpetrov added a commit that referenced this issue Dec 26, 2024
dmpetrov added a commit that referenced this issue Dec 26, 2024
@shcheklein
Copy link
Member

This is happening in SOMA co-working for me as well. The reason is the same issue as here https://github.com/iterative/studio/issues/10235 . Google lib has an exponential backoff on trying to reach out to the metadata service. So, when we don't set credentials explicitly to anon it is trying everything it could and thus hitting that exponential backoff. We need to wait for that PR that @0x2b3bfa0 made to land. And even after that users will have to specify an env variable.

I wonder if we can fix fsspec to disable metadata service if we are not on google (GC fsspec has a way to determine that AFAIU) - wdyt @0x2b3bfa0 ?

@dmpetrov
Copy link
Member Author

This is happening in SOMA co-working for me as well

I'm experiencing the same at my home wifi

@skshetry
Copy link
Member

Related: fsspec/filesystem_spec#1768

@skshetry skshetry changed the title GCP missing creds slows down siginificantly GCP missing creds slows down significantly Jan 14, 2025
@shcheklein
Copy link
Member

@skshetry thanks, I put a comment with the details there. It's a good / relevant discussion indeed.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jan 14, 2025

I wonder if we can fix fsspec to disable metadata service if we are not on google (GC fsspec has a way to determine that AFAIU) - wdyt @0x2b3bfa0 ?

I think exactly https://github.com/iterative/studio/issues/10235#issuecomment-2420957866; I've posted a comment at fsspec/filesystem_spec#1768 (comment).

@amritghimire amritghimire self-assigned this Mar 4, 2025
@amritghimire
Copy link
Contributor

amritghimire commented Mar 20, 2025

The issue here is the expensive check to metadata anytime. The patch below will add an check to simple connect using google credentials so that anon will be applied if necessary.

diff --git a/src/datachain/client/gcs.py b/src/datachain/client/gcs.py
index 03e5245..be9a249 100644
--- a/src/datachain/client/gcs.py
+++ b/src/datachain/client/gcs.py
@@ -16,6 +16,7 @@ from .fsspec import DELIMITER, Client, ResultQueue
 # Patch gcsfs for consistency with s3fs
 GCSFileSystem.set_session = GCSFileSystem._set_session
 PageQueue = asyncio.Queue[Optional[Iterable[dict[str, Any]]]]
+ANONYMOUS_TOKEN = "anon"  # noqa: S105
 
 
 class GCSClient(Client):
@@ -24,12 +25,85 @@ class GCSClient(Client):
     protocol = "gs"
 
     @classmethod
-    def create_fs(cls, **kwargs) -> GCSFileSystem:
+    def _try_authenticate(cls, gcreds, method: str) -> bool:
+        """Attempt to authenticate using the specified method.
+
+        Args:
+            gcreds: Google credentials object
+            method: Authentication method to try
+
+        Returns:
+            bool: True if authentication succeeded, False otherwise
+        """
+        from google.auth.exceptions import GoogleAuthError
+
+        try:
+            gcreds.connect(method=method)
+            return True
+        except (GoogleAuthError, ValueError):
+            # Reset credentials if authentication failed (reverts to 'anon' behavior)
+            gcreds.credentials = None
+            return False
+
+    @classmethod
+    def _get_default_credentials(cls, **kwargs) -> dict[str, Any]:
+        """Get default GCS credentials using various authentication methods.
+
+        Returns:
+            dict: Updated kwargs with appropriate token
+        """
+        from gcsfs.core import DEFAULT_PROJECT
+        from gcsfs.credentials import GoogleCredentials
+        from google.auth.compute_engine._metadata import is_on_gce
+
+        def request_callback(*args, **kwargs):
+            from google.auth.exceptions import TransportError
+
+            raise TransportError("Skip metadata check")
+
+        # If credentials provided in env var, use those
         if os.environ.get("DATACHAIN_GCP_CREDENTIALS"):
             kwargs["token"] = json.loads(os.environ["DATACHAIN_GCP_CREDENTIALS"])
-        if kwargs.pop("anon", False):
-            kwargs["token"] = "anon"  # noqa: S105
+            return kwargs
+
+        # If token is provided, use it
+        if "token" in kwargs:
+            return kwargs
+
+        # If anonymous access requested, force anonymous access
+        if kwargs.get("anon"):
+            kwargs["token"] = ANONYMOUS_TOKEN
+            return kwargs
+
+        # Try various authentication methods
+        gcreds = GoogleCredentials(
+            token=ANONYMOUS_TOKEN,
+            project=DEFAULT_PROJECT,
+            access="full_control",
+        )
+
+        # Define authentication methods to try
+        auth_methods = ["google_default", "cache"]
+        if is_on_gce(request_callback):
+            auth_methods.append("cloud")
+
+        # Try each authentication method
+        for method in auth_methods:
+            if cls._try_authenticate(gcreds, method):
+                return kwargs
 
+        # If no authentication method worked, use anonymous access
+        kwargs["token"] = ANONYMOUS_TOKEN
+        return kwargs
+
+    @classmethod
+    def create_fs(cls, **kwargs) -> GCSFileSystem:
+        """Create a GCS filesystem with appropriate authentication.
+
+        Returns:
+            GCSFileSystem: Authenticated filesystem object
+        """
+        kwargs = cls._get_default_credentials(**kwargs)
         return cast("GCSFileSystem", super().create_fs(**kwargs))
 
     def url(self, path: str, expires: int = 3600, **kwargs) -> str:

@amritghimire
Copy link
Contributor

The difference in time with the following script with clean .datachain is as:

import time
from datachain import DataChain


start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds")

Previous:

Time taken: 27.73566699028015 seconds

Now:

Time taken: 6.101557016372681 seconds

amritghimire added a commit that referenced this issue Mar 20, 2025
Regarding the discussion on #740, this is a possible approach to specify
anon token in gcs if no creds is found.
This is the similar to the approach we already have s3.

The approach is to try different method of connection before defaulting
back to anon as gcs filesystem does. But we are skipping the metadata
check for the case of google cloud engine.

Testing the improvement with the code:
```py
import time
from datachain import DataChain

start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds")
```

The improvement was from:
27.73566699028015 seconds

to:
Time taken: 6.101557016372681 seconds

Closes #740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants