Fix case-insensitive set operations #104

JLHwung · 2024-10-15T19:42:27Z

Fixes #103.

In this PR we applied scf() on the class set operand before doing any set operations.

We also maintain the config.modifiersData even when we are not transforming modifiers, as more and more features will depend on this context data.

In this PR we add unicode case equivalents before we are computing the intersection / substraction. The spec applies MaybeSimpleCaseFolding to every character (loosely like toLowerCase) while we are actually adding unicode case equivalents to every character. I think there is an equivalence relationship between the canonical form (e.g. k) and the set of its case equivalents (e.g. {k, K, \u212A}), so the regex behaviour should not be changed.

~~The cons is that we are generating longer than necessary output when the matcher set contains both uppercase / lowercase letters within the i modifier.~~

nicolo-ribaudo · 2024-10-30T12:26:05Z

The cons is that we are generating longer than necessary output when the matcher set contains both uppercase / lowercase letters within the i modifier.

Can we do some sort of minification? i.e. if there is the i modifier on the regexp, we deduplicate characters that have the same canonical form.

JLHwung · 2024-10-30T16:07:40Z

Can we do some sort of minification?

To do the minification for all unicode codepoints we will need the oneWayMappings defined here:

regexpu-core/scripts/case-mappings.js

Lines 100 to 113 in f745fbe

    
           const commonMappings = require('@unicode/unicode-16.0.0/Case_Folding/C/code-points.js'); 
        
           const simpleMappings = require('@unicode/unicode-16.0.0/Case_Folding/S/code-points.js'); 
        
           // We want the `C` mappings in both directions (i.e. `A` should fold to `a` 
        
           // and `a` to `A`), and the `S` mappings in both directions (i.e. `ẞ` should 
        
           // fold to `ß` and `ß` to `ẞ`). Let’s start with the simple case folding (in 
        
           // one direction) first, then filter the set, and then deal with the inverse. 
        
           const oneWayMappings = new Map(); 
        
           for (const [from, to] of commonMappings) { 
        
           	oneWayMappings.set(from, to); 
        
           } 
        
           for (const [from, to] of simpleMappings) { 
        
           	oneWayMappings.set(from, to); 
        
           }

which matches uppercase to lowercase (loosely speaking). This mapping is not yet exposed as data. Currently we only have map from character to its unicode case-insensitive equivalents: we don't know which one is "lowercase" and thus should be minified to.

If we expose this mapping, then I can just implement the MaybeSimpleCaseFolding and hopefully the generated regex will be always minimal so we can get rid of the extra minification pass, but that of course will increase the code size of this package as we will have two different mapping data derived from the same unicode data source. What do you think about this approach?

In spec, caseFold refers to mapping uppercase letter to the lowercase, here we are actually adding case equivalents to any given set of characters, such that they map to the same character via scf(). To avoid confusion, rename caseFold to caseEquivalents.

tests/fixtures/unicode-set.js

JLHwung · 2024-10-31T19:00:19Z

This PR is ready for preview. I plan to cut a new minor release (because of #98) after this PR gets merged.

JLHwung · 2024-10-31T19:17:48Z

scripts/character-class-escape-sets.js

+
+	ESCAPE_CHARS_UNICODESET_IGNORE_CASE[upper] = {
+		toCode() {
+			return 'UNICODE_IV_SET.clone().remove(' + ESCAPE_CHARS_UNICODESET_IGNORE_CASE[lower].toCode() + ')';


Here we override the toCode prototype method for a much smaller output.

…F to nested class

The matches are already tested in unicode-set.js

data/character-class-escape-sets.js

scripts/case-mappings.js

scripts/character-class-escape-sets.js

mathiasbynens · 2024-11-04T18:07:50Z

Thanks!

JLHwung requested review from nicolo-ribaudo and mathiasbynens October 15, 2024 19:50

JLHwung force-pushed the fix-103 branch from 0931347 to 67c67cd Compare October 30, 2024 15:47

JLHwung added 9 commits October 30, 2024 12:32

fix: expand case foldings before intersection/subtraction

f573e3e

fix: maintain config.modifiersData when we don't transform modifiers

249d867

fix: pass through caseFoldFlags to computeClassStrings

50b21eb

add more test cases

ff95ab9

fix: update the anchor/dot when modifiers are transformed

e9f05a4

add more test cases

e7762a7

build: emit one way mappings to iu-foldings

376a0c0

polish: apply scf() to the class set operand

1df54d4

JLHwung force-pushed the fix-103 branch from a5805c2 to 1df54d4 Compare October 31, 2024 15:50

test: add more test cases

865557f

JLHwung commented Oct 31, 2024

View reviewed changes

tests/fixtures/unicode-set.js Outdated Show resolved Hide resolved

JLHwung added 2 commits October 31, 2024 14:07

perf: apply scf only in intersection/subtraction

de424f2

fix: apply SCF on unicode escape and wW

e34e50b

fix: generate \D and \S from UNICODE_IV_SET

8111c03

JLHwung commented Oct 31, 2024

View reviewed changes

JLHwung added 2 commits October 31, 2024 17:15

fix: call scf on character class range and pass through shouldApplySC…

b65597c

…F to nested class

test: remove matches tests for node 6 compat

e12505c

The matches are already tested in unicode-set.js

mathiasbynens reviewed Nov 4, 2024

View reviewed changes

data/character-class-escape-sets.js Outdated Show resolved Hide resolved

Update data/character-class-escape-sets.js

b246c3c

mathiasbynens reviewed Nov 4, 2024

View reviewed changes

scripts/case-mappings.js Outdated Show resolved Hide resolved

Update scripts/case-mappings.js

2a197f1

mathiasbynens reviewed Nov 4, 2024

View reviewed changes

scripts/character-class-escape-sets.js Outdated Show resolved Hide resolved

Update scripts/character-class-escape-sets.js

ebff51e

mathiasbynens approved these changes Nov 4, 2024

View reviewed changes

JLHwung merged commit 924446a into mathiasbynens:main Nov 21, 2024
4 checks passed

JLHwung deleted the fix-103 branch November 21, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix case-insensitive set operations #104

Fix case-insensitive set operations #104

JLHwung commented Oct 15, 2024 •

edited

Loading

nicolo-ribaudo commented Oct 30, 2024

JLHwung commented Oct 30, 2024 •

edited

Loading

JLHwung commented Oct 31, 2024

JLHwung Oct 31, 2024

mathiasbynens commented Nov 4, 2024

Fix case-insensitive set operations #104

Fix case-insensitive set operations #104

Conversation

JLHwung commented Oct 15, 2024 • edited Loading

nicolo-ribaudo commented Oct 30, 2024

JLHwung commented Oct 30, 2024 • edited Loading

JLHwung commented Oct 31, 2024

JLHwung Oct 31, 2024

Choose a reason for hiding this comment

mathiasbynens commented Nov 4, 2024

JLHwung commented Oct 15, 2024 •

edited

Loading

JLHwung commented Oct 30, 2024 •

edited

Loading