Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline script metadata #913

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Feb 10, 2025

Adding class called ScriptMeta that knows how to parse script inline metadata (https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata) .

Data that we collect from it are:

  • python version
  • dependencies
  • number of workers
  • files
  • params

With this, we can even avoid having special run script settings in Studio, as user can do everything in the script codebase and can seamlessly c/p script between local and Studio.

Example of script with inline metadata:

# /// script
# requires-python = ">=3.12"
#
# dependencies = [
#   "pandas < 2.1.0",
#   "numpy == 1.26.4"
# ]
#
# [tools.datachain.workers]
# num_workers = 3
#
# [tools.datachain.files]
# image1 = "s3://ldb-public/image1.jpg"
# file1 = "s3://ldb-public/file.pdf"
#
# [tools.datachain.params]
# min_length_sec = 1
# cache = false
#
# ///
import sys
import pandas as pd
print(f"Python version: {sys.version_info}")
print(f"Pandas version: {pd.__version__}")

Example of parsing:

meta = ScriptMeta.parse(script)
print(meta.python_version)
print(meta.get_param("cache"))

@ilongin ilongin linked an issue Feb 10, 2025 that may be closed by this pull request
Copy link

cloudflare-workers-and-pages bot commented Feb 10, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 10460e8
Status: ✅  Deploy successful!
Preview URL: https://38850594.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-776-inline-script-me.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 91.11111% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.69%. Comparing base (de171c6) to head (10460e8).

Files with missing lines Patch % Lines
src/datachain/script_meta.py 91.11% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #913      +/-   ##
==========================================
+ Coverage   87.66%   87.69%   +0.03%     
==========================================
  Files         130      131       +1     
  Lines       11698    11743      +45     
  Branches     1592     1595       +3     
==========================================
+ Hits        10255    10298      +43     
- Misses       1043     1044       +1     
- Partials      400      401       +1     
Flag Coverage Δ
datachain 87.61% <91.11%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

def read_inline_meta(script: str) -> Optional[dict]:
"""Converts inline script metadata to dict with all found data"""
regex = (
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this regex was taken from https://packaging.python.org/en/latest/specifications/inline-script-metadata/, but this is just an example and I would like to suggest to update this regexp:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$\s(?P<content>(^#(| .*)$\s)+)^# \/\/\/[ \t]*$"
  1. Escape the / symbol (it is ok in this case, but in general in regex it should be escaped)
  2. Allow trailing spaces (and tabs) in first and last strings (where /// goes)

Also I am not really sure about using \s as a newline matcher, I know $\s combination will only catches newline, but not space or tab, anyway, [\r\n|\r|\n] looks more strict to me:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(^#(| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

Also it is a good idea to use non-capturing groups if no capturing is needed, as it will make regex much faster. For example, in this case I would suggest to use (?:| .*) instead of (| .*), one more group can be converted to non-capturing and the final regex will looks like this:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(?:^#(?:| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

This regex is more strict in terms of regex specifications, a bit more loose in terms of trailing newlines and faster in terms of performance.

Also we may want to use verbose mode (re.VERBOSE) to make it more readable, let me know if I can help with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

I don't think so, in docs they are suggesting to use regex, but we need to take a look 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreadatour thanks for the suggestions, I've updated the regex as you mentioned.
@shcheklein I also haven't seen anything else, but I think it shouldn't be a problem as this is just to get the toml content itself from the comment and then parsing is done with toml library afterwards so regex will not change when we add more things to toml.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only one comment above about regex 👀

@dreadatour
Copy link
Contributor

Oh, and tests fails because of:

ModuleNotFoundError: No module named 'tomllib'

@ilongin
Copy link
Contributor Author

ilongin commented Feb 13, 2025

Oh, and tests fails because of:

ModuleNotFoundError: No module named 'tomllib'

Yea, tomllib is added in python 3.11 version so this is failing in older versions. I'm now using tomli instead which has the same API and works the same.

@ilongin ilongin requested a review from shcheklein February 13, 2025 08:30


@dataclass
class ScriptMeta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilongin what is the long term plan for this class?

can we put this information into an existing Query class / table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term plan is to use it for parsing env data from script itself with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
I'm not sure what did you mean by putting information into existing Query class. Did you mean DatasetQuery ? I would keep it like this and use it in Catalog when we run the scripts.

@ilongin ilongin requested a review from shcheklein February 14, 2025 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline project meta
3 participants