Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull dataset from studio if not available locally #901

Merged
merged 5 commits into from
Feb 10, 2025

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Feb 6, 2025

If the following case are met, this will pull dataset from Studio.

  • User should be logged in to Studio.
  • The dataset or version doesn't exist in local
  • User has not pass fallback_to_remote=False to from_dataset.

In such case, this will pull the dataset from studio before continuing
further.

The test is added to check for such behavior.

Closes #874

If the following case are met, this will pull dataset from Studio.
- User should be logged in to Studio.
- The dataset or version doesn't exist in local
- User has not pass studio=False to from_dataset.

In such case, this will pull the dataset from studio before continuing
further.

The test is added to check for such behavior.

Closes #874
Copy link

cloudflare-workers-and-pages bot commented Feb 6, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: a479a1b
Status: ✅  Deploy successful!
Preview URL: https://0302fb58.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-from-dataset.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire self-assigned this Feb 6, 2025
@amritghimire amritghimire requested review from ilongin, dreadatour and a team February 6, 2025 14:36
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.

Project coverage is 87.65%. Comparing base (1e8bec1) to head (a479a1b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 53.84% 6 Missing ⚠️
src/datachain/catalog/catalog.py 76.92% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #901      +/-   ##
==========================================
- Coverage   87.69%   87.65%   -0.05%     
==========================================
  Files         130      130              
  Lines       11665    11690      +25     
  Branches     1586     1590       +4     
==========================================
+ Hits        10230    10247      +17     
- Misses       1038     1043       +5     
- Partials      397      400       +3     
Flag Coverage Δ
datachain 87.57% <67.85%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1112,6 +1135,21 @@ def __iter__(self):
def __or__(self, other):
return self.union(other)

def pull_dataset(self, name: str, version: Optional[int] = None) -> "DatasetRecord":
print("Dataset not found in local catalog, trying to get from studio")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use logger here in debug mode? @skshetry what is your take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using logger with info here at beginning. But print seemed consistent with other similar messages. And also, we definitely want user to know we are trying to get it from Studio so that they can expect the delay in execution of the code.

src/datachain/query/dataset.py Outdated Show resolved Hide resolved
src/datachain/query/dataset.py Outdated Show resolved Hide resolved
src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@amritghimire amritghimire requested a review from ilongin February 10, 2025 09:59
@amritghimire amritghimire merged commit ea9a904 into main Feb 10, 2025
36 of 37 checks passed
@amritghimire amritghimire deleted the amrit/from_dataset branch February 10, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Studio datasets in Python
3 participants