Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[for reference] all work done which is not in original repo #338

Open
wants to merge 2,108 commits into
base: master
Choose a base branch
from

Conversation

GerHobbelt
Copy link
Contributor

@GerHobbelt GerHobbelt commented Dec 15, 2016

Since a lot has been done and several of these features are tough to 'extract cleanly' to produce 'simple' patches (they won't be simple anyway), the list of differences (features and fixes in the derived repo):

[to be completed]

Main features

  • full Unicode support (okay, astral codepoints are hairy and only partly supported) in lexer and parser

    • lexer can handle XRegExp \pXXX unicode regex atoms, e.g. \p{Alphabetic}

      • jison auto-expands and re-combines these when used inside regex set expressions in macros, e.g.

        ALPHA                                   [{UNICODE_LETTER}a-zA-Z_]
        

        will be reduced to the equivalent of

        ALPHA                                   [{UNICODE_LETTER}_]
        

        hence you don't need to worry your regexes will include duplicate characters in regex [...] set expressions.

    • parser rule names can be Unicode identifiers (you're not limited to US ASCII there).

  • lexer macros can be used inside regex set expressions (in other macros and/or lexer rules); the lexer will barf a hairball (i.e. throw an informative error) when the macro cannot be expanded to represent a character set without causing counter-intuitive results), e.g. this is a legal series of lexer macros now:

    ASCII_LETTER                            [a-zA-z]
    UNICODE_LETTER                          [\p{Alphabetic}{ASCII_LETTER}]
    ALPHA                                   [{UNICODE_LETTER}_]
    DIGIT                                   [\p{Number}]
    WHITESPACE                              [\s\r\n\p{Separator}]
    ALNUM                                   [{ALPHA}{DIGIT}]
    
    NAME                                    [{ALPHA}](?:[{ALNUM}-]*{ALNUM})?
    ID                                      [{ALPHA}]{ALNUM}*
    
  • the parser generator produces optimized parse kernels: any feature you do not use in your grammar (e.g. error rule driven error recovery or @elem location info tracking) is rigorously stripped from the generated parser kernel, producing the fastest possible parser engine.

  • you can define a custom written lexer in the grammar definition file's %lex ... /lex section in case you find the standard lexer is too slow to your liking on otherwise insufficient. (This is done by specifying a no-rules lexer with the custom lexer placed in the lexer trailing action code block.)

  • you can %include action code chunks from external files, in case you find that the action code blurbs obscure the grammar's / lexer's definition. Use this when you have complicated/extensive action code for rules or a large amount of 'trailing code' ~ code following the %% end-of-ruleset marker.

  • CLI: -c 2 -- you now have the choice between two different table compression algorithms:

    • mode 2 creates the smallest tables,
    • mode 1 is the one available in 'vanilla jison' and
    • mode 0 is 'no compression what-so-ever'

Minor 'Selling Points'

Where is this thing heading?

  • using recast et al to help analyze rule action code to help code-strip both parser and lexer to produce fast parse/lex runs. Currently only the parser gets analyzed (a tad roughly) to strip costly operations from the parser run-time to make it fast / efficient.
  • also note migrate towards using a monorepo: ES6/rollup/etc. is a horror otherwise GerHobbelt/jison#16: moving towards a babel-style monorepo. This work has now completed (oct-nov 2017: jison-gho releases 0.6.1-200+)

… in the console.log() statements in there: console.log() adds a newline automatically, while the original C code `printf()` does not.
…ter code stripping. Adjusted stripper regexes to fix this.
… the preceeding commits: `action === 0` is the error parse state and that one, when it is discovered during error **recovery** in the inner slow parse loop, is handed back to the outer loop to prevent undue code duplication. Handing back means the outer loop will have to process that state, not exit on it immediately!
…reset/cleanup the `recoveringErrorInfo` object as one may invoke `yyerrok` while still inside the error recovery phase of the parser, thus *potentially* causing trouble down the lane for subsequent parse states. (This is another edge case that's hard to produce: better-safe-than-sorry coding style applies.)
…amples/Makefile. Tweak `make superclean` to ensure that we can bootstrap once you've run `make prep` by reverting the jison/dist/ directory after 'supercleaning'.
…about a piece of action code which "does not compile": lexer and parser line tracking yylloc info starts counting at line ONE(1) instead of ZERO(0) hence we do NOT need to compensate when bumping down the action code before parsing/validating it in here.
…mpare the full set of examples` output vs. a given reference. This is basically a 'system test' / 'acceptance test' **test level** that co-exists with the unit tests and integration tests in the tests/ directory: those tests are already partly leaning towards a 'system test' level and that is "polluting" the implied simplicity of unit tests...
…ch is included with every generated parser: this makes those reports easier to understand at a glance.
…ippets and other code blocks. We don't want to do them all, so there's #26
…liver a cleaner info set when custom lexers are involved AND not exhibit side effects such as modifying the provided lexer spec when it comes in native format, i.e. doesn't have to be parsed or JSON.parse()d anymore: we should strive for an overall cleaner interface behaviour, even if that makes some internals a tad more hairy.
… it should always have produced an 'expected set of tokens' in the info hash, whether you're running in an error recovery enabled grammar or a simple (non-error-recovering) grammar.
- DO NOT cleanup the old one before we start the new error info track: the old one will *linger* on the error stack and stay alive until we  invoke the parser's cleanup API!
- `recoveringErrorInfo` is also part of the `__error_recovery_infos` array, hence has been destroyed already: no need to do that *twice*.
…llback set a la jison parser run-time:

- `fastLex()`: return next match that has a token. Identical to the `lex()` API but does not invoke any of the `pre_lex()` nor any of the `post_lex()` callbacks.
- `canIUse()`: return info about the lexer state that can help a parser or other lexer API user to use the most efficient means available. This API is provided to aid run-time performance for larger systems which employ this lexer.
- now executes all `pre_lex()` and `post_lex()` callbacks provided as
  + member function, i.e. `lexer.pre_lex()` and `lexer.post_lex()`
  + member of the 'shared state' `yy` as passed to the lexer via the `setInput()` API, i.e. `lexer.yy.pre_lex()` and `lexer.yy.post_lex()`
  + member of the lexer options, i.e. `lexer.options.pre_lex()` and `lexer.options.post_lex()`
…lon rule (which has no location info); add / introduce the `lexer::deriveLocationInfo()` API to help you & us to construct a more-or-less useful/sane location info object from the context surrounding it when the requested location info itself is not available.
…comparison` as it will compare more than just the generated codegen parsers' sources...
…e used to reconstruct missing/epsilon location infos. This helps fix crashes observed when reporting some errors that are triggered while parsing epsilon rules, but will also serve other purposes. The important bit here is that it helps prevent crashes inside the lexer's `prettyPrintRange()` API when no or faulty location info object(s) have been passed as parameters: robuster lexer APIs.
…ed according to the internal action+ parse kernel analysis. NOTE: the fact that the error reporting/recovery logic checks the **lexer.yylineno** lexer attribute does not count as that code won't need / touch the internal `yylineno` variable in any way.
# Conflicts:
#	lib/jison-parser-kernel.js
GerHobbelt and others added 30 commits November 17, 2019 20:51
…dn't work as the `parseError` would not propagate into the parser kernel due to the way `shallow_copy_noclobber` worked. This is quite hairy as we depend on its behaviour of NOT overwriting members so that we can use it for yylloc propagation code inside the kernel. With this fix, that functionality should remain unchanged while now anything set in `parser.yy` should make it into the parser kernel *properly* once again.
…ernel into the main source file (see previous commit)
…de a more robust lexer interface:

        // 1) make sure any outside interference is detected ASAP:
        //    these attributes are to be treated as 'const' values
        //    once the lexer has produced them with the token (return value \`r\`).
        // 2) make sure any subsequent \`lex()\` API invocation CANNOT
        //    edit the \`yytext\`, etc. token attributes for the *current*
        //    token, i.e. provide a degree of 'closure safety' so that
        //    code like this:
        //
        //        t1 = lexer.lex();
        //        v = lexer.yytext;
        //        l = lexer.yylloc;
        //        t2 = lexer.lex();
        //        assert(lexer.yytext !== v);
        //        assert(lexer.yylloc !== l);
        //
        //    succeeds. Older (pre-v0.6.5) jison versions did not *guarantee*
        //    these conditions.
        this.yytext = Object.freeze(this.yytext);
        this.matches = Object.freeze(this.matches);
        this.yylloc.range = Object.freeze(this.yylloc.range);
        this.yylloc = Object.freeze(this.yylloc);
# Conflicts:
#	lib/jison.js
#	package-lock.json
#	package.json
#	packages/jison-lex/regexp-lexer.js
#	packages/jison2json/tests/tests.js
# Conflicts:
#	README.md
#	lib/cli.js
#	package.json
…'re going to take a different route towards parsing jison action code as the current approach is a maintenance nightmare. recast is again playing up and I'm getting sick of it all and that never was the goal of this.
added js-sequence-diagrams to demo projects list
…code to (temporarily) turn the jison generated source code into 'regular javascript' so we can pull it through standard babel or similar tools. (The previous attempt was to enhance the babel tokenizer and have the jison identifiers processed that way, but given the structure of babel, it meant tracking a slew of large packages, which turned out way too costly. So we revert to this 'Unicode hack' which employs the JavaScript specification about which Unicode characters are *legal in a JavaScript identifier*.

TODO: Should write a blog/article about this.

Here's the comments from the horse's mouth:

---

Determine which Unicode NonAsciiIdentifierStart characters
are unused in the given sourcecode and provide a mapping array
from given (JISON) start/end identifier character-sequences
to these.

The purpose of this routine is to deliver a reversible
transform from JISON to plain JavaScript for any action
code chunks.

This is the basic building block which helps us convert
jison variables such as `$id`, `$3`, `$-1` ('negative index' reference),
`@id`, `#id`, `#TOK#` to variable names which can be
parsed by a regular JavaScript parser such as esprima or babylon.

```
function generateMapper4JisonGrammarIdentifiers(input) { ... }
```

IMPORTANT: we only want the single char Unicodes in here
so we can do this transformation at 'Char'-word rather than 'Code'-codepoint level.

```
const IdentifierStart = unicode4IdStart.filter((e) => e.codePointAt(0) < 0xFFFF);
```

As we will be 'encoding' the Jison Special characters @ and # into the IDStart Unicode
range to make JavaScript parsers *not* barf a hairball on Jison action code chunks, we
must consider a few things while doing that:

We CAN use an escape system where we replace a single character with multiple characters,
as JavaScript DOES NOT discern between single characters and multi-character strings: anything
between quotes is a string and there's no such thing as C/C++/C#'s `'c'` vs `"c"` which is
*character* 'c' vs *string* 'c'.

As we can safely escape characters, all we need to do is find a character (or set of characters)
which are in the ID_Start range and are expected to be used rarely while clearly identifyable
by humans for ease of debugging of the escaped intermediate values.

The escape scheme is simple and borrowed from ancient serial communication protocols and
the JavaScript string spec alike:

- assume the escape character is A
- then if the original input stream includes an A, we output AA
- if the original input includes a character #, which must be escaped, it is encoded/output as A

This is the same as the way the backslash escape in JavaScript strings works and has a minor issue:
sequences of AAA with an odd number of A's CAN occur in the output, which might be a little hard to read.
Those are, however, easily machine-decodable and that's what's most important here.

To help with that AAA... issue AND because we need to escape multiple Jison markers, we choose to
a slightly tweaked approach: we are going to use a set of 2-char wide escape codes, where the
first character is fixed and the second character is chosen such that the escape code
DOES NOT occur in the original input -- unless someone would have intentionally fed nasty input
to the encoder as we will pick the 2 characters in the escape from 2 utterly different *human languages*:

- the first character is ဩ which is highly visible and allows us to quickly search through a
  source to see if and where there are *any* Jison escapes.
- the second character is taken from the Unicode CANADIAN SYLLABICS range (0x1400-0x1670) as far as
  those are part of ID_Start (0x1401-0x166C or there-abouts) and, unless an attack is attempted at jison,
  we can be pretty sure that this 2-character sequence won't ever occur in real life: even when one
  writes such a escape in the comments to document this system, e.g. 'ဩᐅ', then there's still plenty
  alternatives for the second character left.
- the second character represents the escape type: $-n, $#, #n, @n, #ID#, etc. and each type will
  pick a different base shape from that CANADIAN SYLLABICS charset.
- note that the trailing '#' in Jison's '#TOKEN#' escape will be escaped as a different code to
  signal '#' as a token terminator there.
- meanwhile, only the initial character in the escape needs to be escaped if encountered in the
  original text: ဩ -> ဩဩ as the 2nd and 3rd character are only there to *augment* the escape.
  Any CANADIAN SYLLABICS in the original input don't need escaping, as these only have special meaning
  when prefixed with ဩ
- if the ဩ character is used often in the text, the alternative ℹ இ ண ஐ Ϟ ല ઊ characters MAY be considered
  for the initial escape code, hence we start with analyzing the entire source input to see which
  escapes we'll come up with this time.

The basic shapes are:

- 1401-141B:  ᐁ             1
- 142F-1448:  ᐯ             2
- 144C-1465:  ᑌ             3
- 146B-1482:  ᑫ             4
- 1489-14A0:  ᒉ             5
- 14A3-14BA:  ᒣ             6
- 14C0-14CF:  ᓀ
- 14D3-14E9:  ᓓ             7
- 14ED-1504:  ᓭ             8
- 1510-1524:  ᔐ             9
- 1526-153D:  ᔦ
- 1542-154F:  ᕂ
- 1553-155C:  ᕓ
- 155E-1569:  ᕞ
- 15B8-15C3:  ᖸ
- 15DC-15ED:  ᗜ            10
- 15F5-1600:  ᗵ
- 1614-1621:  ᘔ
- 1622-162D:  ᘢ

## JISON identifier formats ##

- direct symbol references, e.g. `#NUMBER#` when there's a `%token NUMBER` for your grammar.
  These represent the token ID number.

  -> (1+2) start-# + end-#

- alias/token value references, e.g. `$token`, `$2`

  -> $ is an accepted starter, so no encoding required

- alias/token location reference, e.g. `@token`, `@2`

  -> (6) single-@

- alias/token id numbers, e.g. `#token`, `#2`

  -> (3) single-#

- alias/token stack indexes, e.g. `##token`, `##2`

  -> (4) double-#

- result value reference `$$`

  -> $ is an accepted starter, so no encoding required

- result location reference `@$`

  -> (6) single-@

- rule id number `#$`

  -> (3) single-#

- result stack index `##$`

  -> (4) double-#

- 'negative index' value references, e.g. `$-2`

  -> (8) single-negative-$

- 'negative index' location reference, e.g. `@-2`

  -> (7) single-negative-@

- 'negative index' stack indexes, e.g. `##-2`

  -> (5) double-negative-#
# Conflicts:
#	ports/csharp/Jison/Jison/csharp.js
#	ports/php/php.js
#	ports/php/template.php
…a second argument (`options`): cleaning up calling code which assumed as much.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants