Inline script metadata #913

ilongin · 2025-02-10T23:13:34Z

Adding class called ScriptMeta that knows how to parse script inline metadata (https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata) .

Data that we collect from it are:

python version
dependencies
number of workers
files
params

With this, we can even avoid having special run script settings in Studio, as user can do everything in the script codebase and can seamlessly c/p script between local and Studio.

Example of script with inline metadata:

# /// script
# requires-python = ">=3.12"
#
# dependencies = [
#   "pandas < 2.1.0",
#   "numpy == 1.26.4"
# ]
#
# [tools.datachain.workers]
# num_workers = 3
#
# [tools.datachain.files]
# image1 = "s3://ldb-public/image1.jpg"
# file1 = "s3://ldb-public/file.pdf"
#
# [tools.datachain.params]
# min_length_sec = 1
# cache = false
#
# ///
import sys
import pandas as pd
print(f"Python version: {sys.version_info}")
print(f"Pandas version: {pd.__version__}")

Example of parsing:

meta = ScriptMeta.parse(script)
print(meta.python_version)
print(meta.get_param("cache"))

cloudflare-workers-and-pages · 2025-02-10T23:14:37Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`10460e8`
Status:	✅ Deploy successful!
Preview URL:	https://38850594.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-776-inline-script-me.datachain-documentation.pages.dev

View logs

codecov · 2025-02-10T23:20:26Z

Codecov Report

Attention: Patch coverage is 91.11111% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.69%. Comparing base (de171c6) to head (10460e8).

Files with missing lines	Patch %	Lines
src/datachain/script_meta.py	91.11%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #913      +/-   ##
==========================================
+ Coverage   87.66%   87.69%   +0.03%     
==========================================
  Files         130      131       +1     
  Lines       11698    11743      +45     
  Branches     1592     1595       +3     
==========================================
+ Hits        10255    10298      +43     
- Misses       1043     1044       +1     
- Partials      400      401       +1

Flag	Coverage Δ
datachain	`87.61% <91.11%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dreadatour · 2025-02-12T02:16:12Z

src/datachain/script_meta.py

+    def read_inline_meta(script: str) -> Optional[dict]:
+        """Converts inline script metadata to dict with all found data"""
+        regex = (
+            r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"


I know this regex was taken from https://packaging.python.org/en/latest/specifications/inline-script-metadata/, but this is just an example and I would like to suggest to update this regexp:

Suggested change

r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"

r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$\s(?P<content>(^#(| .*)$\s)+)^# \/\/\/[ \t]*$"

Escape the / symbol (it is ok in this case, but in general in regex it should be escaped)

Allow trailing spaces (and tabs) in first and last strings (where /// goes)

Also I am not really sure about using \s as a newline matcher, I know $\s combination will only catches newline, but not space or tab, anyway, [\r\n|\r|\n] looks more strict to me:

Suggested change

r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"

r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(^#(| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

Also it is a good idea to use non-capturing groups if no capturing is needed, as it will make regex much faster. For example, in this case I would suggest to use (?:| .*) instead of (| .*), one more group can be converted to non-capturing and the final regex will looks like this:

Suggested change

r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"

r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(?:^#(?:| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

This regex is more strict in terms of regex specifications, a bit more loose in terms of trailing newlines and faster in terms of performance.

Also we may want to use verbose mode (re.VERBOSE) to make it more readable, let me know if I can help with this.

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

I don't think so, in docs they are suggesting to use regex, but we need to take a look 👀

@dreadatour thanks for the suggestions, I've updated the regex as you mentioned.
@shcheklein I also haven't seen anything else, but I think it shouldn't be a problem as this is just to get the toml content itself from the comment and then parsing is done with toml library afterwards so regex will not change when we add more things to toml.

dreadatour

Looks good to me! Only one comment above about regex 👀

dreadatour · 2025-02-12T02:25:19Z

Oh, and tests fails because of:

ModuleNotFoundError: No module named 'tomllib'

ilongin · 2025-02-13T08:29:57Z

Oh, and tests fails because of:
ModuleNotFoundError: No module named 'tomllib'

Yea, tomllib is added in python 3.11 version so this is failing in older versions. I'm now using tomli instead which has the same API and works the same.

shcheklein · 2025-02-13T17:41:23Z

src/datachain/script_meta.py

+
+
+@dataclass
+class ScriptMeta:


@ilongin what is the long term plan for this class?

can we put this information into an existing Query class / table?

Long term plan is to use it for parsing env data from script itself with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
I'm not sure what did you mean by putting information into existing Query class. Did you mean DatasetQuery ? I would keep it like this and use it in Catalog when we run the scripts.

ilongin added 2 commits February 11, 2025 00:04

added script meta class to parse script meta information from comment

8950762

added comments

54e528c

ilongin linked an issue Feb 10, 2025 that may be closed by this pull request

Inline project meta #776

Open

ilongin added 2 commits February 11, 2025 11:21

adding unit tests

c0c9d0f

fixing lint

aefd129

ilongin requested review from dreadatour, amritghimire and skshetry February 11, 2025 10:27

dreadatour reviewed Feb 12, 2025

View reviewed changes

dreadatour approved these changes Feb 12, 2025

View reviewed changes

ilongin added 3 commits February 13, 2025 08:38

Merge branch 'main' into ilongin/776-inline-script-metadata

7116fe3

updated regex

8f8cbba

added tomli instead of using tomllib

10460e8

ilongin requested a review from shcheklein February 13, 2025 08:30

shcheklein reviewed Feb 13, 2025

View reviewed changes

ilongin requested a review from shcheklein February 14, 2025 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline script metadata #913

Inline script metadata #913

ilongin commented Feb 10, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 10, 2025 •

edited

Loading

codecov bot commented Feb 10, 2025 •

edited

Loading

dreadatour Feb 12, 2025

shcheklein Feb 12, 2025

dreadatour Feb 13, 2025

ilongin Feb 13, 2025

dreadatour left a comment

dreadatour commented Feb 12, 2025

ilongin commented Feb 13, 2025

shcheklein Feb 13, 2025

ilongin Feb 14, 2025

	r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(\| .*)$\s)+)^# ///$"
	r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]$\s(?P<content>(^#(\| .)$\s)+)^# \/\/\/[ \t]*$"

Inline script metadata #913

Are you sure you want to change the base?

Inline script metadata #913

Conversation

ilongin commented Feb 10, 2025 • edited Loading

cloudflare-workers-and-pages bot commented Feb 10, 2025 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Feb 10, 2025 • edited Loading

Codecov Report

dreadatour Feb 12, 2025

Choose a reason for hiding this comment

shcheklein Feb 12, 2025

Choose a reason for hiding this comment

dreadatour Feb 13, 2025

Choose a reason for hiding this comment

ilongin Feb 13, 2025

Choose a reason for hiding this comment

dreadatour left a comment

Choose a reason for hiding this comment

dreadatour commented Feb 12, 2025

ilongin commented Feb 13, 2025

shcheklein Feb 13, 2025

Choose a reason for hiding this comment

ilongin Feb 14, 2025

Choose a reason for hiding this comment

ilongin commented Feb 10, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 10, 2025 •

edited

Loading

codecov bot commented Feb 10, 2025 •

edited

Loading