-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type Inference for Regular Expressions #60249
base: main
Are you sure you want to change the base?
Conversation
The TypeScript team hasn't accepted the linked issue #50452. If you can get it accepted, this PR will have a better chance of being reviewed. |
1 similar comment
The TypeScript team hasn't accepted the linked issue #50452. If you can get it accepted, this PR will have a better chance of being reviewed. |
OMG it worked great on macOS but one of the new test case file I wrote hit the time constraints on Ubuntu and Windows |
Looks like you're introducing a change to the public API surface area. If this includes breaking changes, please document them on our wiki's API Breaking Changes page. Also, please make sure @DanielRosenwasser and @RyanCavanaugh are aware of the changes, just as a heads up. |
… since the performance issue is not able to be fixed for now.
After fixing the self check, I got a lot of "Type 'RegExp' is not generic" during the build step. |
…uments` for the build step to pass
No, I forgot that this doesn't affect the builder |
…oTypeArguments` for the build step to pass" This reverts commit 99355c0.
…due to self check fix
OK, so all the checks are fixed and have passed, and it signifies the PR is now ready for review. Thank you everyone for the long wait and I apologise I did not mark the PR as draft initially. |
I'll run tests on this but I highly suspect this will be super bad for breaks and perf @typescript-bot test it |
@jakebailey Here are the results of running the user tests with tsc comparing Something interesting changed - please have a look. Details
|
Hey @jakebailey, the results of running the DT tests are ready. There were interesting changes: Branch only errors:Package: dotfile-regex
Package: markdownlint-rule-helpers
Package: node/v16
Package: file-loader
Package: hexo-util
Package: regenerate
Package: snowflake-regex
Package: ember__utils/v3
Package: node/v18
Package: node/v20
Package: node
Package: badwords
Package: camelize
Package: dirname-regex
Package: es6-shim
Package: lapo__asn1js
Package: k6
Package: koa__router
Package: react-mentions
Package: amazon-cognito-auth-js
Package: get-intrinsic
Package: hashtag-regex
Package: hookrouter
Package: hubot-slack
Package: path-regex
Package: dotdir-regex
Package: hubot
Package: re-template-tag
Package: to-regex
Package: sanitize-html
Package: time-limited-regular-expressions
Package: axon
Package: hex-color-regex
Package: package-name-regex
Package: doi-regex
Package: filename-reserved-regex
Package: package-json-validator
Package: string.prototype.matchall
Package: eslint-utils
Package: github-url-from-git
Package: is-regex
Package: oojs-ui
Package: postman-collection
Package: twitter_cldr
Package: underscore
Package: es-abstract
Package: named-regexp-groups
Package: unc-path-regex
Package: regex-not
Package: riot-route
Package: string-replace-loader
Package: uswds__uswds
Package: restify/v4
Package: dasherize
Package: french-badwords-list
Package: simplecrawler
Package: restify/v5
Package: ember/v3
|
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@jakebailey Here are the results of running the top 400 repos with tsc comparing Something interesting changed - please have a look. Details
|
@jakebailey Here are some more interesting changes from running the top 400 repos suite Details
|
@jakebailey Here are some more interesting changes from running the top 400 repos suite Details
|
@jakebailey Here are some more interesting changes from running the top 400 repos suite Details
|
@jakebailey Here are some more interesting changes from running the top 400 repos suite Details
|
@jakebailey Here are some more interesting changes from running the top 400 repos suite Details
|
Very breaky but it’s within my expectation 😞 |
"'\\k' is only available outside character class.": { | ||
"category": "Error", | ||
"code": 1545 | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, a new message can’t take the place of an old, unused one right?
(TS1513 never actually appears as TS1161 would have been emitted at the same position if a RegExp is unterminated, so replacing the message shouldn’t cause any problems.)
Another possible way to improve performance is to keep all template literals unexpand and try to establish a way to compute type relations of literals with interpolated strings, but it's probably out of the scope of this PR. |
This PR adds type inference support to regular expressions for accurate and fine type checking.
For example, the type of the following regex:
is inferred as (prettified manually):
To back this,
RegExp
is made a generic type and library typings related to it are largely modified. It now automatically takes care of which type doesString#match
return,RegExpExecArray<CapturingGroups, NamedCapturingGroups, Flags>
orRegExpMatchArray<CapturingGroups>
:If you try to access a nonexistent group, an error is now emitted:
On the contrary,
error TS2532: Object is possibly 'undefined'.
is not emitted on thegroups
property since there are named capturing groups inpathRegex
.When the global flag is set on a regex,
RegExpMatchArray
is inferred instead:If you try to call
matchAll
withpathRegex
, an error occurs (partially truncated):This is desired as JS throws a
TypeError
when the argument provided is a non-global regex.Regarding the `ignoreCase` (i) flag
If the
ignoreCase
flag is set on the regex, all of the combinations of uppercase and lowercase characters will be computed if the cross product isn't too large. For example, the type ofis inferred as (prettified manually):
This PR can be considered the follow up of #55600.
Implementation
This implementation is tailored for TypeScript. It doesn't create any additional syntactic nodes at all. Instead, it feeds strings into
RegularExpressionPattern
s (actually arrays) andRegularExpressionPatternUnion
(actuallySet
s) 1 (temporarily named, to be bikeshedded) during scanning inscanRegularExpressionWorker
, stores them by capturing groups and passes them tocheckRegularExpressionLiteral
in the checker.I could have moved the whole
scanRegularExpressionWorker
to the checker, but I chose to keep it in the scanner for easier reviewing and becausescanEscapeSequence
is referenced in the worker function and moving it out creates duplicate code.This is probably not the intended code structure, however I don't think I am the right person to alter the codebase structure largely, which is one of the reason why I keep the changes minimal. (It's still a significant number of lines though)
This PR is a breaking change. In the worst case, a new flag in
tsconfig.json
might be necessary if it's really too breaky.There are currently a few tricky workaround that doesn’t seem acceptable in the TypeScript codebase. For example, some overloads in
es5.d.ts
are redeclared ines2015.symbol.wellknown.d.ts
for them to be prioritized. Nevertheless, things do behave as intended. Besides, there are some underscore types ines5.d.ts
due to #2225 – although they can be eliminated by modularizing the file just like what we have done inesnext.iterator.d.ts
, I chose not to do so for the time being to await for feedback from the TS team.In addition, I have only undertaken a limited number of memory and performance optimizations given my lack of experience in this area. I am therefore seeking assistance from the TS team in this regard.
Blockers
Besides #2225 mentioned above,
#51751 and#45972 are also blockers of this PR.Due to #51751,CapturingGroupsArray
can't be typed[string, ...(string | undefined)[]]
, which causescapturingGroups_0: string
to be missing beforecapturingGroups: (string | undefined)[]
in the type ofStringReplaceCallbackSignature
(with default type parameters).Edit: The workaround to #2225 actually avoided the need to type
CapturingGroupsArray
as{ 0: string; } & (string | undefined)[]
.#45972 causes assignability issues between existing functions and
StringReplaceCallbackSignature
with default type parameters, which happens when RegExp is constructed from other sources, e.g. by its constructor with a string.Fixes
(Is it necessary to separate it out to another PR?)
\k
in character classes whenhasNamedCapturingGroups
istrue
Known issues
\k<foo>
is inferred as the empty string instead of"foo"
in/((?<foo>foo))\k<foo>/
TS1515 is not emitted in/((?<foo>))(?<foo>)/
(essentially the same issue as the above)Unhandled cases & other known issues
They are probably unfixable or are only fixable with high performance and memory cost.
undefined
/(foo)?bar\1/
is inferred asRegExp<["bar" | "barfoo" | "foobar" | "foobarfoo"], ...>
, but in reality"barfoo"
and"foobar"
never matchnew RegExp(/a/, "i")
andnew RegExp(/a/i, "")
does not cause a rescan and alter the type inference between"a"
and"a" | "A"
Type-safe workaround (Proof-of-concept)
"abc"
is not ruled out in/[\q{abc}--\q{abc}]/i
(which should be inferred asRegExp<[never], ...>
) but is ruled out in/[\q{abc}&&\q{abc}]/i
(SeescanClassSetSubExpression
)ignoreCase
flag modification (e.g./(a)(?i:\1)/
and/(a)(?-i:\1)/i
) does not alter the type of backreferences (between"a"
and"a" | "A"
)TODOs
I am not sure if I could accomplish them by myself due to my limit time and knowledge, it's the best if readers/the TypeScript Team could help!
const regexes = [dateTimeRegex];
Points to bikeshed
The names of all variables
The maximum cross product size for template literals (currently 10,000; for non-RegExp it's currently 100,000)
How accurate should the lib be typed, e.g. should the following be included for
RegExp
constructors?Fixes #32098
Closes #50452
Footnotes
I didn't name it
RegularExpressionDisjunction
orRegularExpressionAlternatives
as it's also used as a container for character classes. ↩