-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP missing creds slows down significantly #740
Comments
This is happening in SOMA co-working for me as well. The reason is the same issue as here https://github.com/iterative/studio/issues/10235 . Google lib has an exponential backoff on trying to reach out to the metadata service. So, when we don't set credentials explicitly to anon it is trying everything it could and thus hitting that exponential backoff. We need to wait for that PR that @0x2b3bfa0 made to land. And even after that users will have to specify an env variable. I wonder if we can fix fsspec to disable metadata service if we are not on google (GC fsspec has a way to determine that AFAIU) - wdyt @0x2b3bfa0 ? |
I'm experiencing the same at my home wifi |
Related: fsspec/filesystem_spec#1768 |
@skshetry thanks, I put a comment with the details there. It's a good / relevant discussion indeed. |
I think exactly https://github.com/iterative/studio/issues/10235#issuecomment-2420957866; I've posted a comment at fsspec/filesystem_spec#1768 (comment). |
The issue here is the expensive check to metadata anytime. The patch below will add an check to simple connect using google credentials so that anon will be applied if necessary. diff --git a/src/datachain/client/gcs.py b/src/datachain/client/gcs.py
index 03e5245..be9a249 100644
--- a/src/datachain/client/gcs.py
+++ b/src/datachain/client/gcs.py
@@ -16,6 +16,7 @@ from .fsspec import DELIMITER, Client, ResultQueue
# Patch gcsfs for consistency with s3fs
GCSFileSystem.set_session = GCSFileSystem._set_session
PageQueue = asyncio.Queue[Optional[Iterable[dict[str, Any]]]]
+ANONYMOUS_TOKEN = "anon" # noqa: S105
class GCSClient(Client):
@@ -24,12 +25,85 @@ class GCSClient(Client):
protocol = "gs"
@classmethod
- def create_fs(cls, **kwargs) -> GCSFileSystem:
+ def _try_authenticate(cls, gcreds, method: str) -> bool:
+ """Attempt to authenticate using the specified method.
+
+ Args:
+ gcreds: Google credentials object
+ method: Authentication method to try
+
+ Returns:
+ bool: True if authentication succeeded, False otherwise
+ """
+ from google.auth.exceptions import GoogleAuthError
+
+ try:
+ gcreds.connect(method=method)
+ return True
+ except (GoogleAuthError, ValueError):
+ # Reset credentials if authentication failed (reverts to 'anon' behavior)
+ gcreds.credentials = None
+ return False
+
+ @classmethod
+ def _get_default_credentials(cls, **kwargs) -> dict[str, Any]:
+ """Get default GCS credentials using various authentication methods.
+
+ Returns:
+ dict: Updated kwargs with appropriate token
+ """
+ from gcsfs.core import DEFAULT_PROJECT
+ from gcsfs.credentials import GoogleCredentials
+ from google.auth.compute_engine._metadata import is_on_gce
+
+ def request_callback(*args, **kwargs):
+ from google.auth.exceptions import TransportError
+
+ raise TransportError("Skip metadata check")
+
+ # If credentials provided in env var, use those
if os.environ.get("DATACHAIN_GCP_CREDENTIALS"):
kwargs["token"] = json.loads(os.environ["DATACHAIN_GCP_CREDENTIALS"])
- if kwargs.pop("anon", False):
- kwargs["token"] = "anon" # noqa: S105
+ return kwargs
+
+ # If token is provided, use it
+ if "token" in kwargs:
+ return kwargs
+
+ # If anonymous access requested, force anonymous access
+ if kwargs.get("anon"):
+ kwargs["token"] = ANONYMOUS_TOKEN
+ return kwargs
+
+ # Try various authentication methods
+ gcreds = GoogleCredentials(
+ token=ANONYMOUS_TOKEN,
+ project=DEFAULT_PROJECT,
+ access="full_control",
+ )
+
+ # Define authentication methods to try
+ auth_methods = ["google_default", "cache"]
+ if is_on_gce(request_callback):
+ auth_methods.append("cloud")
+
+ # Try each authentication method
+ for method in auth_methods:
+ if cls._try_authenticate(gcreds, method):
+ return kwargs
+ # If no authentication method worked, use anonymous access
+ kwargs["token"] = ANONYMOUS_TOKEN
+ return kwargs
+
+ @classmethod
+ def create_fs(cls, **kwargs) -> GCSFileSystem:
+ """Create a GCS filesystem with appropriate authentication.
+
+ Returns:
+ GCSFileSystem: Authenticated filesystem object
+ """
+ kwargs = cls._get_default_credentials(**kwargs)
return cast("GCSFileSystem", super().create_fs(**kwargs))
def url(self, path: str, expires: int = 3600, **kwargs) -> str:
|
The difference in time with the following script with clean .datachain is as: import time
from datachain import DataChain
start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds") Previous: Time taken: 27.73566699028015 seconds Now: Time taken: 6.101557016372681 seconds |
Regarding the discussion on #740, this is a possible approach to specify anon token in gcs if no creds is found. This is the similar to the approach we already have s3. The approach is to try different method of connection before defaulting back to anon as gcs filesystem does. But we are skipping the metadata check for the case of google cloud engine. Testing the improvement with the code: ```py import time from datachain import DataChain start = time.time() images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg") images.show() end = time.time() print(f"Time taken: {end - start} seconds") ``` The improvement was from: 27.73566699028015 seconds to: Time taken: 6.101557016372681 seconds Closes #740
Description
See quick start.
This command take 20 seconds (sometimes 30) without any progressbar if I don't have any GCP creds setup. If I put
anon=True
- it works instantly like it suppose to.This affects all users since we use GCP in most of our examples and, I'm sure, majority of user do not have GCP creds set up.
What's needed: we need to avoid the delay when GCP creds are not set up. Ideally, without specifying
anon=True
which adds more code in examples and mental overhead for users.ToDo:
anon=True
from Readme and Get StartedThe text was updated successfully, but these errors were encountered: