Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher-level parser APIs #132

Merged
merged 12 commits into from
Aug 6, 2023
Merged
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ jobs:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Shallow clones should be disabled for a better relevancy of SonarCloud analysis
- name: Set up JDK 11 # Needed for sonarscanner
- name: Set up JDK 17 # Needed for sonarscanner
uses: actions/setup-java@v3
with:
distribution: 'microsoft'
java-version: '11'
java-version: '17'
- name: Set up .NET SDK from global.json
uses: actions/setup-dotnet@v3
- name: Restore .NET local tools
Expand Down
46 changes: 27 additions & 19 deletions designs/7.0/parser-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ public static class ParserCompletionStateExtensions

public interface IParser<TChar, T> : IServiceProvider
{
void Run(ref ParserInputReader<TChar> inputReader, ref ParserCompletionState<T> completionState);
void Run(ref ParserInputReader<TChar> input, ref ParserCompletionState<T> completionState);
}
```

Expand Down Expand Up @@ -230,7 +230,7 @@ public abstract class ParserStateContext<TChar, T> : ParserStateContext<TChar>
// These members can be overridden by user code.
// Performs any additional resetting logic.
protected virtual void OnReset() {}
protected abstract void Run(ref ParserInputReader<TChar> inputReader, ref ParserCompletionState<T?> completionState);
protected abstract void Run(ref ParserInputReader<TChar> input, ref ParserCompletionState<T?> completionState);
}

public static class ParserStateContext
Expand Down Expand Up @@ -271,16 +271,16 @@ public static ParserResult<T> Parse<T>(IParser<char, T> parser, TextReader reade
Semantic analysis (called _post-processing_ in earlier versions of Farkle) is the process of converting a parse tree to an object meaningful for the application. This behavior is controlled by the `ISemanticProvider` interfaces.

```csharp
namespace Farkle.Parser.SemanticAnalysis;
namespace Farkle.Parser.Semantics;

public interface ITransformer<TChar>
public interface ITokenSemanticProvider<TChar>
{
object? Transform(ref ParserState state, TokenSymbolHandle terminal, ReadOnlySpan<TChar> data);
object? Transform(ref ParserState state, TokenSymbolHandle symbol, ReadOnlySpan<TChar> characters);
}

public interface IFuser
public interface IProductionSemanticProvider
{
object? Fuse(ref ParserState state, ProductionHandle production, Span<object?> children);
object? Fuse(ref ParserState state, ProductionHandle production, Span<object?> members);
}

// The legacy 7.x F# codebase used a base IPostProcessor interface
Expand All @@ -300,30 +300,33 @@ The following things changed since Farkle 6:
3. `Fuse` accepts a reference to a `ParserState`, allowing stateful fuses; why not?
4. `Fuse` accepts a read-write span of the production's member values, instead of a read-only span in previous versions of Farkle. This allows the span to be used as a temporary buffer by the fuser; it would easily enable certain scenarios and the buffer gets discarded afterwards either way.

> In earlier versions of this document the `ITokenSemanticProvider` and `IProductionSemanticProvider` interfaces were called `ITransformer` and `IFuser` respectively. They were eventually renamed, since historically transformers and fusers in Farkle process tokens and productions of specific kinds, while these interfaces _multiplex_ over the available transformers and fusers.

### Predefined services

The initial release of Farkle 7 will provide the following parser services. All are optional.

#### Getting the grammar of a parser

The `IGrammarProvider` service interface allows getting the grammar of a parser, if the parser is backed by one (it doesn't have to). For simplicity the `Grammar` object also implements that interface.
We define the `IGrammarProvider` interface as an abstraction over the `Farkle.Grammars.Grammar` type. It has methods to get the concrete grammar, and to look up a symbol by its special name (allowing in the future to be performed without reading the entire grammar binary blob).

Besides some overall utility, this interface can be returned as a service by parsers that are backed by a Farkle grammar (they don't have to). For simplicity the `Grammar` object also implements that interface.

```csharp
namespace Farkle.Grammars;

public interface IGrammarProvider
{
Grammar GetGrammar();

EntityHandle GetSymbolFromSpecialName(string specialName, bool throwIfNotFound = false);
}

public abstract class Grammar : IGrammarProvider
{
public Grammar GetGrammar() => this;
}
Grammar IGrammarProvider.GetGrammar() => this;

public static class GrammarProviderExtensions
{
public static Grammar? GetGrammar(this IServiceProvider serviceProvider);
// GetSymbolFromSpecialName is implemented implicitly.
}
```

Expand Down Expand Up @@ -384,7 +387,7 @@ TODO: What would happen if we inject a token in the middle of tokenizing? It mig
Farkle 7 defines the following API for tokenizers and the tokens they produce:

```csharp
namespace Farkle.Parser.LexicalAnalysis;
namespace Farkle.Parser.Tokenizers;

// Represents the result of a tokenizer invocation.
public readonly struct TokenizerResult
Expand All @@ -407,7 +410,7 @@ public abstract class Tokenizer<TChar>
{
protected Tokenizer();

public abstract bool TryGetNextToken(ref ParserInputReader<TChar> inputReader, ITransformer<TChar> transformer, out TokenizerResult result);
public abstract bool TryGetNextToken(ref ParserInputReader<TChar> input, ITokenSemanticProvider<TChar> semanticProvider, out TokenizerResult result);
}
```

Expand Down Expand Up @@ -449,15 +452,20 @@ public abstract class CharParser<T> : IParser<char, T>
// information about the failure by parsing an empty string.
// If a grammar has a defective DFA but a valid LR(1) table,
// it is possible to fix the parser by changing its tokenizer.
public abstract bool IsFailing { get; }
public bool IsFailing { get; }

public abstract Grammar GetGrammar();
public Grammar GetGrammar();

// In Farkle 6 these methods were called Change***. The With suffix
// communicates better that the existing instance is not mutated.
public abstract CharParser<TNew> WithSemanticProvider<TNew>(ISemanticProvider<char, TNew> semanticProvider);
public CharParser<TNew> WithSemanticProvider<TNew>(ISemanticProvider<char, TNew> semanticProvider);

// Some semantic providers (such as the generic AST) need access to
// the grammar and there is no other way to provide it. Putting it
// in ParserState becomes tricky.
public CharParser<TNew> WithSemanticProvider<TNew>(Func<IGrammarProvider, ISemanticProvider<char, TNew>> semanticProviderFactory);

public abstract CharParser<T> WithTokenizer(Tokenizer<char> tokenizer);
public CharParser<T> WithTokenizer(Tokenizer<char> tokenizer);
}

public static class CharParser
Expand Down
54 changes: 21 additions & 33 deletions designs/7.0/tokenizer-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,62 +25,50 @@ This arrangement of tokenizers imposes some constraints -like tokenizers not bei
To be created, a tokenizer needs the grammar of the language it will operate on. Farkle 6 had the `TokenizerFactory` class with an abstract `Create` method that accepted a grammar and returned a tokenizer. For Farkle 7 we want to be slightly more flexible, as well as support building chained tokenizers:

```csharp
namespace Farkle.Parser.LexicalAnalysis;

public readonly struct TokenizerFactoryContext
{
public TokenizerFactoryContext(Grammar? grammar);

// These two methods will fail if the grammar in the constructor
// or the ChainedTokenizerBuilder.Build method is null. They will
// never fail when creating a tokenizer for a CharParser since it
// guarantees that a grammar exists.
public Grammar GetGrammar();
public EntityHandle GetSymbolFromSpecialName(string name, bool throwIfNotFound = false);
}
namespace Farkle.Parser.Tokenizers;

public sealed class ChainedTokenizerBuilder<TChar>
{
// A placeholder for the existing tokenizer of a CharParser.
public static ChainedTokenizerBuilder<TChar> Default { get; }

public static ChainedTokenizerBuilder<TChar> Create(Tokenizer<TChar> tokenizer);

public static ChainedTokenizerBuilder<TChar> Create(
Func<TokenizerFactoryContext, Tokenizer<TChar>> tokenizerFactory);
Func<IGrammarProvider, Tokenizer<TChar>> tokenizerFactory);

// Starts the chain with the existing tokenizer of a CharParser.
public static ChainedTokenizerBuilder<TChar> CreateDefault();

// The Append methods are immutable and return a new builder with the new tokenizer appended.
public ChainedTokenizerBuilder<TChar> Append(Tokenizer<TChar> tokenizer);

public ChainedTokenizerBuilder<TChar> Append(
Func<TokenizerFactoryContext, Tokenizer<TChar>> tokenizerFactory);
Func<IGrammarProvider, Tokenizer<TChar>> tokenizerFactory);

public ChainedTokenizerBuilder<TChar> Append(ChainedTokenizerBuilder<TChar> builder);

public ChainedTokenizerBuilder<TChar> AppendDefault();

// If grammar is null, TokenizerFactoryContext.GetGrammar will throw in the tokenizer factories.
// If grammar is null, IGrammarProvider.GetGrammar will throw in the tokenizer factories.
// If defaultTokenizer is null, using ChainedTokenizerBuilder.Default in the chain will throw.
public Tokenizer<TChar> Build(Grammar? grammar = null, Tokenizer<TChar>? defaultTokenizer = null);
public Tokenizer<TChar> Build(IGrammarProvider? grammar = null, Tokenizer<TChar>? defaultTokenizer = null);
}

namespace Farkle;

public abstract partial class CharParser<T>
{
// Already defined at parser-api.md:
// public abstract CharParser<T> WithTokenizer(Tokenizer<T> tokenizer);
// public CharParser<T> WithTokenizer(Tokenizer<T> tokenizer);

public abstract CharParser<T> WithTokenizer(
Func<TokenizerFactoryContext, Tokenizer<T>> tokenizerFactory);
public CharParser<T> WithTokenizer(
Func<IGrammarProvider, Tokenizer<T>> tokenizerFactory);

public abstract CharParser<T> WithTokenizer(ChainedTokenizerBuilder<T> builder);
public CharParser<T> WithTokenizer(ChainedTokenizerBuilder<T> builder);
}
```

Besides simple tokenizer objects, the tokenizer of a `CharParser` can be changed by providing a _tokenizer factory_ or a _chained tokenizer builder_.

A tokenizer factory is a delegate that accepts a `TokenizerFactoryContext` and returns a tokenizer. We use `TokenizerFactoryContext` instead of just `Grammar` to allow in the future looking up the special names without depending on the entire grammar API.
A tokenizer factory is a delegate that accepts a `IGrammarProvider` and returns a tokenizer. We use `IGrammarProvider` instead of just `Grammar` to allow in the future looking up the special names without depending on the entire grammar API.

A chained tokenizer builder builds a chain of tokenizers from the start to the end and can be either passed to a `CharParser` or used standalone. Each component of a chained tokenizer builder can be a tokenizer, a tokenizer factory or another chained tokenizer builder. The `Default` property of `ChainedTokenizerBuilder` is a builder that starts with the existing tokenizer of a `CharParser` as its only component. The `AppendDefault` method appends that default tokenizer to the chain.

Expand All @@ -89,18 +77,18 @@ A chained tokenizer builder builds a chain of tokenizers from the start to the e
We will provide the following APIs to support suspending the tokenization process:

```csharp
namespace Farkle.Parser.LexicalAnalysis;
namespace Farkle.Parser.Tokenizers;

public interface ITokenizerResumptionPoint<TChar, in TArg>
{
bool TryGetNextToken(ref ParserInputReader<TChar> inputReader, ITransformer<TChar> transformer, TArg arg, out TokenizerResult token);
bool TryGetNextToken(ref ParserInputReader<TChar> input, ITransformer<TChar> transformer, TArg arg, out TokenizerResult token);
}

public static class TokenizerExtensions
{
public static void SuspendTokenizer<TChar>(this ref ParserInputReader<TChar> inputReader,
public static void SuspendTokenizer<TChar>(this ref ParserInputReader<TChar> input,
Tokenizer<TChar> tokenizer);
public static void SuspendTokenizer<TChar, TArg>(this ref ParserInputReader<TChar> inputReader,
public static void SuspendTokenizer<TChar, TArg>(this ref ParserInputReader<TChar> input,
ITokenizerResumptionPoint<TChar, TArg> suspensionPoint, TArg argument);
}
```
Expand All @@ -115,7 +103,7 @@ The arguments to the `SuspendTokenizer` methods determine where the chain will c
public class MyTokenizer : Tokenizer<char>, ITokenizerResumptionPoint<char, MyTokenizer.Case1Args>,
ITokenizerResumptionPoint<char, MyTokenizer.Case2Args>
{
public override bool TryGetNextToken(ref ParserInputReader<char> inputReader,
public override bool TryGetNextToken(ref ParserInputReader<char> input,
ITransformer<char> transformer, out TokenizerResult token)
{
if (/* case 1 */)
Expand All @@ -136,14 +124,14 @@ public class MyTokenizer : Tokenizer<char>, ITokenizerResumptionPoint<char, MyTo
}
}

bool ITokenizerResumptionPoint<char, Case1Args>.TryGetNextToken(ref ParserInputReader<char> inputReader,
bool ITokenizerResumptionPoint<char, Case1Args>.TryGetNextToken(ref ParserInputReader<char> input,
ITransformer<char> transformer, Case1Args arg, out TokenizerResult token)
{
// Case 1 resumes here with more characters.
// …
}

bool ITokenizerResumptionPoint<char, Case2Args>.TryGetNextToken(ref ParserInputReader<char> inputReader,
bool ITokenizerResumptionPoint<char, Case2Args>.TryGetNextToken(ref ParserInputReader<char> input,
ITransformer<char> transformer, Case2Args arg, out TokenizerResult token)
{
// Case 2 resumes here with more characters.
Expand All @@ -168,7 +156,7 @@ Another way to avoid the indirection is to add the following API to `ParserInput
```csharp
public static class TokenizerExtensions
{
public static void ProcessSuspendedTokenizer(this ref ParserInputReader<TChar> inputReaders);
public static void ProcessSuspendedTokenizer(this ref ParserInputReader<TChar> input);
}
```

Expand Down
Loading
Loading