-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline script metadata #913
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
Latest commit: |
10460e8
|
Status: | ✅ Deploy successful! |
Preview URL: | https://38850594.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-776-inline-script-me.datachain-documentation.pages.dev |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #913 +/- ##
==========================================
+ Coverage 87.66% 87.69% +0.03%
==========================================
Files 130 131 +1
Lines 11698 11743 +45
Branches 1592 1595 +3
==========================================
+ Hits 10255 10298 +43
- Misses 1043 1044 +1
- Partials 400 401 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
src/datachain/script_meta.py
Outdated
def read_inline_meta(script: str) -> Optional[dict]: | ||
"""Converts inline script metadata to dict with all found data""" | ||
regex = ( | ||
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this regex was taken from https://packaging.python.org/en/latest/specifications/inline-script-metadata/, but this is just an example and I would like to suggest to update this regexp:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$\s(?P<content>(^#(| .*)$\s)+)^# \/\/\/[ \t]*$" |
- Escape the
/
symbol (it is ok in this case, but in general in regex it should be escaped) - Allow trailing spaces (and tabs) in first and last strings (where
///
goes)
Also I am not really sure about using \s
as a newline matcher, I know $\s
combination will only catches newline, but not space or tab, anyway, [\r\n|\r|\n]
looks more strict to me:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(^#(| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$" |
Also it is a good idea to use non-capturing groups if no capturing is needed, as it will make regex much faster. For example, in this case I would suggest to use (?:| .*)
instead of (| .*)
, one more group can be converted to non-capturing and the final regex will looks like this:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(?:^#(?:| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$" |
This regex is more strict in terms of regex specifications, a bit more loose in terms of trailing newlines and faster in terms of performance.
Also we may want to use verbose mode (re.VERBOSE
) to make it more readable, let me know if I can help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)
I don't think so, in docs they are suggesting to use regex, but we need to take a look 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreadatour thanks for the suggestions, I've updated the regex as you mentioned.
@shcheklein I also haven't seen anything else, but I think it shouldn't be a problem as this is just to get the toml
content itself from the comment and then parsing is done with toml library afterwards so regex will not change when we add more things to toml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Only one comment above about regex 👀
Oh, and tests fails because of:
|
Yea, |
|
||
|
||
@dataclass | ||
class ScriptMeta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilongin what is the long term plan for this class?
can we put this information into an existing Query class / table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term plan is to use it for parsing env data from script itself with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
I'm not sure what did you mean by putting information into existing Query class. Did you mean DatasetQuery
? I would keep it like this and use it in Catalog
when we run the scripts.
Adding class called
ScriptMeta
that knows how to parse script inline metadata (https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata) .Data that we collect from it are:
With this, we can even avoid having special run script settings in Studio, as user can do everything in the script codebase and can seamlessly c/p script between local and Studio.
Example of script with inline metadata:
Example of parsing: