Skip to content

Commit 84ec339

Browse files
leebyronandimarek
andauthored
RFC: Allow full unicode range (#849)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!) Co-authored-by: Andreas Marek <[email protected]>
1 parent 00b88f0 commit 84ec339

File tree

5 files changed

+164
-66
lines changed

5 files changed

+164
-66
lines changed

Diff for: build.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ GITTAG=$(git tag --points-at HEAD)
77
# Build the specification draft document
88
echo "Building spec draft"
99
mkdir -p public/draft
10-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > public/draft/index.html
10+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > public/draft/index.html
1111

1212
# If this is a tagged commit, also build the release document
1313
if [ -n "$GITTAG" ]; then
1414
echo "Building spec release $GITTAG"
1515
mkdir -p "public/$GITTAG"
16-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > "public/$GITTAG/index.html"
16+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > "public/$GITTAG/index.html"
1717
fi
1818

1919
# Create the index file

Diff for: package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
},
1515
"scripts": {
1616
"test": "npm run test:build && npm run test:spellcheck",
17-
"test:build": "spec-md spec/GraphQL.md > /dev/null",
17+
"test:build": "spec-md --metadata spec/metadata.json spec/GraphQL.md > /dev/null",
1818
"test:spellcheck": "cspell 'spec/**/*.md' README.md",
1919
"format": "prettier --write '**/*.{md,yml,yaml,json}'",
2020
"format:check": "prettier --check '**/*.{md,yml,yaml,json}'",

Diff for: spec/Appendix B -- Grammar Summary.md

+11-7
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,7 @@
22

33
## Source Text
44

5-
SourceCharacter ::
6-
7-
- "U+0009"
8-
- "U+000A"
9-
- "U+000D"
10-
- "U+0020–U+FFFF"
5+
SourceCharacter :: "Any Unicode scalar value"
116

127
## Ignored Tokens
138

@@ -113,7 +108,16 @@ StringCharacter ::
113108
- `\u` EscapedUnicode
114109
- `\` EscapedCharacter
115110

116-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
111+
EscapedUnicode ::
112+
113+
- `{` HexDigit+ `}`
114+
- HexDigit HexDigit HexDigit HexDigit
115+
116+
HexDigit :: one of
117+
118+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
119+
- `A` `B` `C` `D` `E` `F`
120+
- `a` `b` `c` `d` `e` `f`
117121

118122
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
119123

Diff for: spec/Section 2 -- Language.md

+135-56
Original file line numberDiff line numberDiff line change
@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
4545

4646
## Source Text
4747

48-
SourceCharacter ::
48+
SourceCharacter :: "Any Unicode scalar value"
4949

50-
- "U+0009"
51-
- "U+000A"
52-
- "U+000D"
53-
- "U+0020–U+FFFF"
50+
GraphQL documents are interpreted from a source text, which is a sequence of
51+
{SourceCharacter}, each {SourceCharacter} being a _Unicode scalar value_ which
52+
may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53+
(informally referred to as _"characters"_ through most of this specification).
5454

55-
GraphQL documents are expressed as a sequence of
56-
[Unicode](https://unicode.org/standard/standard.html) code points (informally
57-
referred to as _"characters"_ through most of this specification). However, with
58-
few exceptions, most of GraphQL is expressed only in the original non-control
59-
ASCII range so as to be as widely compatible with as many existing tools,
60-
languages, and serialization formats as possible and avoid display issues in
61-
text editors and source control.
55+
A GraphQL document may be expressed only in the ASCII range to be as widely
56+
compatible with as many existing tools, languages, and serialization formats as
57+
possible and avoid display issues in text editors and source control. Non-ASCII
58+
Unicode scalar values may appear within {StringValue} and {Comment}.
6259

63-
Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64-
{Comment} portions of GraphQL.
65-
66-
### Unicode
67-
68-
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69-
70-
The "Byte Order Mark" is a special Unicode character which may appear at the
71-
beginning of a file containing Unicode which programs may use to determine the
72-
fact that the text stream is Unicode, what endianness the text stream is in, and
73-
which of several Unicode encodings to interpret.
60+
Note: An implementation which uses _UTF-16_ to represent GraphQL documents in
61+
memory (for example, JavaScript or Java) may encounter a _surrogate pair_. This
62+
encodes one _supplementary code point_ and is a single valid source character,
63+
however an unpaired _surrogate code point_ is not a valid source character.
7464

7565
### White Space
7666

@@ -115,10 +105,9 @@ CommentChar :: SourceCharacter but not LineTerminator
115105
GraphQL source documents may contain single-line comments, starting with the
116106
{`#`} marker.
117107

118-
A comment can contain any Unicode code point in {SourceCharacter} except
119-
{LineTerminator} so a comment always consists of all code points starting with
120-
the {`#`} character up to but not including the {LineTerminator} (or end of the
121-
source).
108+
A comment may contain any {SourceCharacter} except {LineTerminator} so a comment
109+
always consists of all {SourceCharacter} starting with the {`#`} character up to
110+
but not including the {LineTerminator} (or end of the source).
122111

123112
Comments are {Ignored} like white space and may appear after any token, or
124113
before a {LineTerminator}, and have no significance to the semantic meaning of a
@@ -175,6 +164,16 @@ significant way, for example a {StringValue} may contain white space characters.
175164
No {Ignored} may appear _within_ a {Token}, for example no white space
176165
characters are permitted between the characters defining a {FloatValue}.
177166

167+
**Byte order mark**
168+
169+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
170+
171+
The _Byte Order Mark_ is a special Unicode code point which may appear at the
172+
beginning of a file which programs may use to determine the fact that the text
173+
stream is Unicode, and what specific encoding has been used. As files are often
174+
concatenated, a _Byte Order Mark_ may appear before or after any lexical token
175+
and is {Ignored}.
176+
178177
### Punctuators
179178

180179
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -812,7 +811,16 @@ StringCharacter ::
812811
- `\u` EscapedUnicode
813812
- `\` EscapedCharacter
814813

815-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
814+
EscapedUnicode ::
815+
816+
- `{` HexDigit+ `}`
817+
- HexDigit HexDigit HexDigit HexDigit
818+
819+
HexDigit :: one of
820+
821+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
822+
- `A` `B` `C` `D` `E` `F`
823+
- `a` `b` `c` `d` `e` `f`
816824

817825
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
818826

@@ -821,19 +829,57 @@ BlockStringCharacter ::
821829
- SourceCharacter but not `"""` or `\"""`
822830
- `\"""`
823831

824-
Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
825-
{`"Hello World"`}). White space and other otherwise-ignored characters are
826-
significant within a string value.
832+
A {StringValue} is evaluated to a _Unicode text_ value, a sequence of _Unicode
833+
scalar value_, by interpreting all escape sequences using the static semantics
834+
defined below. White space and other characters ignored between lexical tokens
835+
are significant within a string value.
827836

828837
The empty string {`""`} must not be followed by another {`"`} otherwise it would
829838
be interpreted as the beginning of a block string. As an example, the source
830839
{`""""""`} can only be interpreted as a single empty block string and not three
831840
empty strings.
832841

833-
Non-ASCII Unicode characters are allowed within single-quoted strings. Since
834-
{SourceCharacter} must not contain some ASCII control characters, escape
835-
sequences must be used to represent these characters. The {`\`}, {`"`}
836-
characters also must be escaped. All other escape sequences are optional.
842+
**Escape Sequences**
843+
844+
In a single-quoted {StringValue}, any _Unicode scalar value_ may be expressed
845+
using an escape sequence. GraphQL strings allow both C-style escape sequences
846+
(for example `\n`) and two forms of Unicode escape sequences: one with a
847+
fixed-width of 4 hexadecimal digits (for example `\u000A`) and one with a
848+
variable-width most useful for representing a _supplementary character_ such as
849+
an Emoji (for example `\u{1F4A9}`).
850+
851+
The hexadecimal number encoded by a Unicode escape sequence must describe a
852+
_Unicode scalar value_, otherwise must result in a parse error. For example both
853+
sources `"\uDEAD"` and `"\u{110000}"` should not be considered valid
854+
{StringValue}.
855+
856+
Escape sequences are only meaningful within a single-quoted string. Within a
857+
block string, they are simply that sequence of characters (for example
858+
`"""\n"""` represents the _Unicode text_ [U+005C, U+006E]). Within a comment an
859+
escape sequence is not a significant sequence of characters. They may not appear
860+
elsewhere in a GraphQL document.
861+
862+
Since {StringCharacter} must not contain some code points directly (for example,
863+
a {LineTerminator}), escape sequences must be used to represent them. All other
864+
escape sequences are optional and unescaped non-ASCII Unicode characters are
865+
allowed within strings. If using GraphQL within a system which only supports
866+
ASCII, then escape sequences may be used to represent all Unicode characters
867+
outside of the ASCII range.
868+
869+
For legacy reasons, a _supplementary character_ may be escaped by two
870+
fixed-width unicode escape sequences forming a _surrogate pair_. For example the
871+
input `"\uD83D\uDCA9"` is a valid {StringValue} which represents the same
872+
_Unicode text_ as `"\u{1F4A9}"`. While this legacy form is allowed, it should be
873+
avoided as a variable-width unicode escape sequence is a clearer way to encode
874+
such code points.
875+
876+
When producing a {StringValue}, implementations should use escape sequences to
877+
represent non-printable control characters (U+0000 to U+001F and U+007F to
878+
U+009F). Other escape sequences are not necessary, however an implementation may
879+
use escape sequences to represent any other range of code points (for example,
880+
when producing ASCII-only output). If an implementation chooses to escape a
881+
_supplementary character_, it should only use a variable-width unicode escape
882+
sequence.
837883

838884
**Block Strings**
839885

@@ -889,51 +935,84 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
889935
quoted string with appropriate escape sequences must be used instead of a block
890936
string.
891937

892-
**Semantics**
938+
**Static Semantics**
939+
940+
:: A {StringValue} describes a _Unicode text_ value, which is a sequence of
941+
_Unicode scalar value_.
942+
943+
These semantics describe how to apply the {StringValue} grammar to a source text
944+
to evaluate a _Unicode text_. Errors encountered during this evaluation are
945+
considered a failure to apply the {StringValue} grammar to a source and must
946+
result in a parsing error.
893947

894948
StringValue :: `""`
895949

896950
- Return an empty sequence.
897951

898952
StringValue :: `"` StringCharacter+ `"`
899953

900-
- Return the sequence of all {StringCharacter} code points.
954+
- Return the _Unicode text_ by concatenating the evaluation of all
955+
{StringCharacter}.
901956

902957
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
903958

904-
- Return the code point {SourceCharacter}.
959+
- Return the _Unicode scalar value_ {SourceCharacter}.
905960

906961
StringCharacter :: `\u` EscapedUnicode
907962

908-
- Let {value} be the 16-bit hexadecimal value represented by the sequence of
909-
hexadecimal digits within {EscapedUnicode}.
910-
- Return the code point {value}.
963+
- Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
964+
within {EscapedUnicode}.
965+
- Assert {value} is a within the _Unicode scalar value_ range (>= 0x0000 and <=
966+
0xD7FF or >= 0xE000 and <= 0x10FFFF).
967+
- Return the _Unicode scalar value_ {value}.
968+
969+
StringCharacter :: `\u` HexDigit HexDigit HexDigit HexDigit `\u` HexDigit
970+
HexDigit HexDigit HexDigit
971+
972+
- Let {leadingValue} be the hexadecimal value represented by the first sequence
973+
of {HexDigit}.
974+
- Let {trailingValue} be the hexadecimal value represented by the second
975+
sequence of {HexDigit}.
976+
- If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _Leading Surrogate_):
977+
- Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _Trailing Surrogate_).
978+
- Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
979+
0x10000.
980+
- Otherwise:
981+
- Assert {leadingValue} is within the _Unicode scalar value_ range.
982+
- Assert {trailingValue} is within the _Unicode scalar value_ range.
983+
- Return the sequence of the _Unicode scalar value_ {leadingValue} followed by
984+
the _Unicode scalar value_ {trailingValue}.
985+
986+
Note: If both escape sequences encode a _Unicode scalar value_, then this
987+
semantic is identical to applying the prior semantic on each fixed-width escape
988+
sequence. A variable-width escape sequence must only encode a _Unicode scalar
989+
value_.
911990

912991
StringCharacter :: `\` EscapedCharacter
913992

914-
- Return the code point represented by {EscapedCharacter} according to the table
915-
below.
993+
- Return the _Unicode scalar value_ represented by {EscapedCharacter} according
994+
to the table below.
916995

917-
| Escaped Character | Code Point | Character Name |
918-
| ----------------- | ---------- | ---------------------------- |
919-
| {`"`} | U+0022 | double quote |
920-
| {`\`} | U+005C | reverse solidus (back slash) |
921-
| {`/`} | U+002F | solidus (forward slash) |
922-
| {`b`} | U+0008 | backspace |
923-
| {`f`} | U+000C | form feed |
924-
| {`n`} | U+000A | line feed (new line) |
925-
| {`r`} | U+000D | carriage return |
926-
| {`t`} | U+0009 | horizontal tab |
996+
| Escaped Character | Scalar Value | Character Name |
997+
| ----------------- | ------------ | ---------------------------- |
998+
| {`"`} | U+0022 | double quote |
999+
| {`\`} | U+005C | reverse solidus (back slash) |
1000+
| {`/`} | U+002F | solidus (forward slash) |
1001+
| {`b`} | U+0008 | backspace |
1002+
| {`f`} | U+000C | form feed |
1003+
| {`n`} | U+000A | line feed (new line) |
1004+
| {`r`} | U+000D | carriage return |
1005+
| {`t`} | U+0009 | horizontal tab |
9271006

9281007
StringValue :: `"""` BlockStringCharacter\* `"""`
9291008

930-
- Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
931-
Unicode character values (which may be an empty sequence).
1009+
- Let {rawValue} be the _Unicode text_ by concatenating the evaluation of all
1010+
{BlockStringCharacter} (which may be an empty sequence).
9321011
- Return the result of {BlockStringValue(rawValue)}.
9331012

9341013
BlockStringCharacter :: SourceCharacter but not `"""` or `\"""`
9351014

936-
- Return the character value of {SourceCharacter}.
1015+
- Return the _Unicode scalar value_ {SourceCharacter}.
9371016

9381017
BlockStringCharacter :: `\"""`
9391018

Diff for: spec/metadata.json

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"biblio": {
3+
"https://www.unicode.org/glossary": {
4+
"byte-order-mark": "#byte_order_mark",
5+
"leading-surrogate": "#leading_surrogate",
6+
"trailing-surrogate": "#trailing_surrogate",
7+
"supplementary-character": "#supplementary_character",
8+
"supplementary-code-point": "#supplementary_code_point",
9+
"surrogate-code-point": "#surrogate_code_point",
10+
"surrogate-pair": "#surrogate_pair",
11+
"unicode-scalar-value": "#unicode_scalar_value",
12+
"utf-16": "#UTF_16"
13+
}
14+
}
15+
}

0 commit comments

Comments
 (0)