Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable EQUS expansion for raw symbols (by lexing them as strings) #1649

Closed
wants to merge 1 commit into from

Conversation

Rangi42
Copy link
Contributor

@Rangi42 Rangi42 commented Feb 5, 2025

Fixes #1605

The goal here is to make #raw string symbols get treated like string literals/expressions instead of getting EQUS-expanded. The long-term goal is to implement that in a way compatible with removing EQUS expansion for all symbols (#905).

This approach works by just adding a new SYM_ID token type, and lexing raw identifiers which would otherwise have been EQUS-expanded as SYM_ID tokens instead. Then the parser has to treat this token type as a string in the right contexts. So now, in addition to the relocexpr_no_str distinction, it needs a scoped_anon_id_no_str distinction.

This PR needs further comparison with the alternative parser-based approach implemented in #1648.

@Rangi42 Rangi42 changed the title Disable EQUS expansion for raw symbols ("the lexer approach") Disable EQUS expansion for raw symbols (by lexing them as strings) Feb 5, 2025
@Rangi42 Rangi42 added enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM labels Feb 5, 2025
@Rangi42 Rangi42 added this to the 0.9.2 milestone Feb 5, 2025
@Rangi42 Rangi42 force-pushed the no-raw-lexer branch 7 times, most recently from 5104a5e to fa6c463 Compare February 7, 2025 16:03
@Rangi42
Copy link
Contributor Author

Rangi42 commented Feb 7, 2025

What feels natural to me about this:

  • Lack of repetition. The lexer does one check for whether a symbol is defined as a string or not, and returns different tokens depending on that.
  • Fewer lines of code than the alternative.

The downsides:

  • Reminiscent of the "lexer hack" in C, where an identifier is lexed differently depending on its category from previous context.
  • The parser has a redundant error check for sym && sym->type == SYM_EQUS, even though the lexer whould never return a STR_SYMBOL if that condition were false.
    • This Discord comment has my reasoning why alternatives are worse, although I could be persuaded that assume(condition); do stuff; is better than if (condition) { do stuff; } else { error; }.
  • Affects error messages, which now refer to "string symbol"s separately from "symbol"s.

@Rangi42 Rangi42 requested a review from ISSOtm February 7, 2025 16:13
@Rangi42 Rangi42 marked this pull request as draft February 8, 2025 22:22
@Rangi42
Copy link
Contributor Author

Rangi42 commented Feb 12, 2025

Re: comparisons to the C lexer hack, Wikipedia claims:

Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's semantic analysis library to determine the nature of the identifier. This allows a simpler and more maintainable architecture than The Lexer Hack. This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.

This "modern" approach sounds similar to the one used in the other PR, #1648.

Copy link
Member

@ISSOtm ISSOtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spirit of this PR looks good to me, I think I only have bikeshedding-level changes to request. Thanks!

Comment on lines +1108 to 1110
.Dq raw symbol
prefixed by a hash
.Sq # .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be more explicit that “raw symbol” means “prefixed with a hash”.

I'm also realising that “raw identifier” is a better name, since it might not be referring to a symbol yet? (And identifiers are names for symbols. I'd be fine with calling those “raw names”, if that sounds better than “identifier”.)

Comment on lines +1392 to +1393
Expansion is also disabled for raw string constant symbols (string constant symbols prefixed by a hash
.Sq # ) .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the vein of my previous comment:

Suggested change
Expansion is also disabled for raw string constant symbols (string constant symbols prefixed by a hash
.Sq # ) .
Also, expansion is
.Sy not
performed if the string symbol's name is preceded by a hash
.Sq # .

@@ -1,7 +1,7 @@
error: def-scoped.asm(10):
syntax error, unexpected local label, expecting symbol
syntax error, unexpected local label, expecting symbol or string symbol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this sounds confusing. “Why is it expecting a symbol or a symbol?”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is one reason why I prefer #1648 as well.

@Rangi42 Rangi42 closed this Feb 12, 2025
@Rangi42 Rangi42 deleted the no-raw-lexer branch February 12, 2025 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Raw #symbols would not go through EQUS expansion
2 participants