Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to run the datachain query to Studio #579

Merged
merged 7 commits into from
Nov 29, 2024
Merged

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Nov 9, 2024

The options are:

positional arguments:

  query_file            The query file to run.

options:

--team TEAM           The team to run a job for. By default, it will use team from config.
  --env-file ENV_FILE   File containing environment variables to set for the job.
  --envs ENVS [ENVS ...]
                        Environment variables to set for the job.
  --workers WORKERS     Number of workers to use for the job.
  --files FILES [FILES ...]
                        Files to include in the job.
  --python-version PYTHON_VERSION
                        Python version to use for the job (e.g. '3.9', '3.10', '3.11').
  --req-file REQ_FILE   File containing Python package requirements.
  --reqs REQS [REQS ...]
                        Python package requirements.

Example run:

Example script to run

$ datachain studio run example_query.py --env-file=env_file.txt --envs="ENV_FROM_ARGS=1" --workers=2 --files file.txt --python-version=3.12 --req-file=reqs.txt --reqs="oneliners"

Files:

run/env_file.txt:

ENV_FROM_FILE = 'environments.txt'

run/file.txt

content from file

run/reqs.txt

pyjokes

run/example_query.py

from datachain import DataChain
from os import environ
from oneliners import get_random
import pyjokes

# Define the UDF:
def path_len(path):
    if path.endswith(".json"):
        return (-1,)
    return (len(path),)

if __name__ == "__main__":
    # Run in chain
    print("Environment set from file:", environ["ENV_FROM_FILE"])
    print("Environment set from args:", environ["ENV_FROM_ARGS"])
    print("Oneliners from reqs(args):", get_random())
    print("Joke from pyjokes:(from reqs file)", pyjokes.get_joke())

    print("Content from files(args):", open("file.txt").read())

    DataChain.from_storage(
        uri="gs://datachain-demo/dogs-and-cats/",
    ).map(
        path_len,
        params=["file.path"],
        output={"path_len": int},
    ).show()

TODO:

  • Rename the argument names to better names
  • Add tests

Companion PR: https://github.com/iterative/studio/pull/10897

Sorry, something went wrong.

Verified

This commit was signed with the committer’s verified signature.
amritghimire Amrit Ghimire
The options are:
positional arguments:
  query_file            The query file to run.

options:
--team TEAM           The team to run a job for. By default, it will use team from config.
  --env-file ENV_FILE   File containing environment variables to set for the job.
  --envs ENVS [ENVS ...]
                        Environment variables to set for the job.
  --workers WORKERS     Number of workers to use for the job.
  --files FILES [FILES ...]
                        Files to include in the job.
  --python-version PYTHON_VERSION
                        Python version to use for the job (e.g. '3.9', '3.10', '3.11').
  --req-file REQ_FILE   File containing Python package requirements.
  --reqs REQS [REQS ...]
                        Python package requirements.

Example run:
------------
Example script to run
```sh
$ datachain studio run example_query.py --env-file=env_file.txt --envs="ENV_FROM_ARGS=1" --workers=2 --files file.txt --python-version=3.12 --req-file=reqs.txt --reqs="oneliners"
```

Files:
------
`run/env_file.txt`:

```
ENV_FROM_FILE = 'environments.txt'

```
`run/file.txt`

```
content from file

```

`run/reqs.txt`

```
pyjokes
```

`run/example_query.py`

```py

from datachain import DataChain
from os import environ
from oneliners import get_random
import pyjokes

# Define the UDF:
def path_len(path):
    if path.endswith(".json"):
        return (-1,)
    return (len(path),)

if __name__ == "__main__":
    # Run in chain
    print("Environment set from file:", environ["ENV_FROM_FILE"])
    print("Environment set from args:", environ["ENV_FROM_ARGS"])
    print("Oneliners from reqs(args):", get_random())
    print("Joke from pyjokes:(from reqs file)", pyjokes.get_joke())

    print("Content from files(args):", open("file.txt").read())

    DataChain.from_storage(
        uri="gs://datachain-demo/dogs-and-cats/",
    ).map(
        path_len,
        params=["file.path"],
        output={"path_len": int},
    ).show()

```

TODO:

- Rename the argument names to better names
- Add tests
Copy link

cloudflare-workers-and-pages bot commented Nov 9, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1feefc7
Status: ✅  Deploy successful!
Preview URL: https://16045bd7.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-create-job.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire self-assigned this Nov 9, 2024
Copy link

codecov bot commented Nov 9, 2024

Codecov Report

Attention: Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.

Project coverage is 87.70%. Comparing base (045f3b0) to head (1feefc7).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/studio.py 71.05% 4 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #579      +/-   ##
==========================================
- Coverage   87.71%   87.70%   -0.01%     
==========================================
  Files         112      112              
  Lines       10694    10753      +59     
  Branches     1439     1448       +9     
==========================================
+ Hits         9380     9431      +51     
- Misses        954      956       +2     
- Partials      360      366       +6     
Flag Coverage Δ
datachain 87.64% <81.35%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

amritghimire and others added 3 commits November 26, 2024 11:01

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was signed with the committer’s verified signature.
amritghimire Amrit Ghimire
@amritghimire amritghimire marked this pull request as ready for review November 27, 2024 11:12
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

In the follow-up issues/PRs we may want to stream state/logs/progress from Studio.

Also how can we check job execution status? May be first step might be to add job status command?

And job logs command before streaming may be?

studio_run_parser.add_argument(
"--envs",
nargs="+",
help="Environment variables to set for the job.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be more verbose on how to set env variables via --envs flag? Also consider renaming this to --env.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, on the naming (and probably --req)

studio_run_parser.add_argument(
"--reqs",
nargs="+",
help="Python package requirements.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, more information on how to add requirements from command-line flag would be nice to have.

reqs: Optional[str] = None,
req_file: Optional[str] = None,
):
query_type = "PYTHON" if query_file.endswith(".py") else "SHELL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a problem in future. I think we should let user choose file type either by additional argument or even by separate command.

Comment on lines +169 to +170
with open(query_file) as f:
query = f.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if this is a valid Python file before sending it may be?

environment = "\n".join(envs) if envs else ""
if env_file:
with open(env_file) as f:
environment = f.read() + "\n" + environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line in the beginning if no envs defined, but that's may be ok.

requirements = "\n".join(reqs) if reqs else ""
if req_file:
with open(req_file) as f:
requirements = f.read() + "\n" + requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, empty line in the beginning if no reqs defined.

Verified

This commit was signed with the committer’s verified signature.
amritghimire Amrit Ghimire

Verified

This commit was signed with the committer’s verified signature.
amritghimire Amrit Ghimire

Verified

This commit was signed with the committer’s verified signature.
amritghimire Amrit Ghimire
@amritghimire amritghimire merged commit d27ab5d into main Nov 29, 2024
36 of 38 checks passed
@amritghimire amritghimire deleted the amrit/create-job branch November 29, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants