@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
45
45
46
46
## Source Text
47
47
48
- SourceCharacter ::
48
+ SourceCharacter :: "Any Unicode scalar value"
49
49
50
- - "U+0009"
51
- - "U+000A"
52
- - "U+000D"
53
- - "U+0020–U+FFFF"
50
+ GraphQL documents are interpreted from a source text, which is a sequence of
51
+ {SourceCharacter}, each {SourceCharacter} being a _ Unicode scalar value _ which
52
+ may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53
+ (informally referred to as _ "characters" _ through most of this specification).
54
54
55
- GraphQL documents are expressed as a sequence of
56
- [ Unicode] ( https://unicode.org/standard/standard.html ) code points (informally
57
- referred to as _ "characters"_ through most of this specification). However, with
58
- few exceptions, most of GraphQL is expressed only in the original non-control
59
- ASCII range so as to be as widely compatible with as many existing tools,
60
- languages, and serialization formats as possible and avoid display issues in
61
- text editors and source control.
55
+ A GraphQL document may be expressed only in the ASCII range to be as widely
56
+ compatible with as many existing tools, languages, and serialization formats as
57
+ possible and avoid display issues in text editors and source control. Non-ASCII
58
+ Unicode scalar values may appear within {StringValue} and {Comment}.
62
59
63
- Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64
- {Comment} portions of GraphQL.
65
-
66
- ### Unicode
67
-
68
- UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69
-
70
- The "Byte Order Mark" is a special Unicode character which may appear at the
71
- beginning of a file containing Unicode which programs may use to determine the
72
- fact that the text stream is Unicode, what endianness the text stream is in, and
73
- which of several Unicode encodings to interpret.
60
+ Note: An implementation which uses _ UTF-16_ to represent GraphQL documents in
61
+ memory (for example, JavaScript or Java) may encounter a _ surrogate pair_ . This
62
+ encodes one _ supplementary code point_ and is a single valid source character,
63
+ however an unpaired _ surrogate code point_ is not a valid source character.
74
64
75
65
### White Space
76
66
@@ -115,10 +105,9 @@ CommentChar :: SourceCharacter but not LineTerminator
115
105
GraphQL source documents may contain single-line comments, starting with the
116
106
{` # ` } marker.
117
107
118
- A comment can contain any Unicode code point in {SourceCharacter} except
119
- {LineTerminator} so a comment always consists of all code points starting with
120
- the {` # ` } character up to but not including the {LineTerminator} (or end of the
121
- source).
108
+ A comment may contain any {SourceCharacter} except {LineTerminator} so a comment
109
+ always consists of all {SourceCharacter} starting with the {` # ` } character up to
110
+ but not including the {LineTerminator} (or end of the source).
122
111
123
112
Comments are {Ignored} like white space and may appear after any token, or
124
113
before a {LineTerminator}, and have no significance to the semantic meaning of a
@@ -175,6 +164,16 @@ significant way, for example a {StringValue} may contain white space characters.
175
164
No {Ignored} may appear _ within_ a {Token}, for example no white space
176
165
characters are permitted between the characters defining a {FloatValue}.
177
166
167
+ ** Byte order mark**
168
+
169
+ UnicodeBOM :: "Byte Order Mark (U+FEFF)"
170
+
171
+ The _ Byte Order Mark_ is a special Unicode code point which may appear at the
172
+ beginning of a file which programs may use to determine the fact that the text
173
+ stream is Unicode, and what specific encoding has been used. As files are often
174
+ concatenated, a _ Byte Order Mark_ may appear before or after any lexical token
175
+ and is {Ignored}.
176
+
178
177
### Punctuators
179
178
180
179
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -812,7 +811,16 @@ StringCharacter ::
812
811
- ` \u ` EscapedUnicode
813
812
- ` \ ` EscapedCharacter
814
813
815
- EscapedUnicode :: /[ 0-9A-Fa-f] {4}/
814
+ EscapedUnicode ::
815
+
816
+ - ` { ` HexDigit+ ` } `
817
+ - HexDigit HexDigit HexDigit HexDigit
818
+
819
+ HexDigit :: one of
820
+
821
+ - ` 0 ` ` 1 ` ` 2 ` ` 3 ` ` 4 ` ` 5 ` ` 6 ` ` 7 ` ` 8 ` ` 9 `
822
+ - ` A ` ` B ` ` C ` ` D ` ` E ` ` F `
823
+ - ` a ` ` b ` ` c ` ` d ` ` e ` ` f `
816
824
817
825
EscapedCharacter :: one of ` " ` ` \ ` ` / ` ` b ` ` f ` ` n ` ` r ` ` t `
818
826
@@ -821,19 +829,57 @@ BlockStringCharacter ::
821
829
- SourceCharacter but not ` """ ` or ` \""" `
822
830
- ` \""" `
823
831
824
- Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
825
- {` "Hello World" ` }). White space and other otherwise-ignored characters are
826
- significant within a string value.
832
+ A {StringValue} is evaluated to a _ Unicode text_ value, a sequence of _ Unicode
833
+ scalar value_ , by interpreting all escape sequences using the static semantics
834
+ defined below. White space and other characters ignored between lexical tokens
835
+ are significant within a string value.
827
836
828
837
The empty string {` "" ` } must not be followed by another {` " ` } otherwise it would
829
838
be interpreted as the beginning of a block string. As an example, the source
830
839
{` """""" ` } can only be interpreted as a single empty block string and not three
831
840
empty strings.
832
841
833
- Non-ASCII Unicode characters are allowed within single-quoted strings. Since
834
- {SourceCharacter} must not contain some ASCII control characters, escape
835
- sequences must be used to represent these characters. The {` \ ` }, {` " ` }
836
- characters also must be escaped. All other escape sequences are optional.
842
+ ** Escape Sequences**
843
+
844
+ In a single-quoted {StringValue}, any _ Unicode scalar value_ may be expressed
845
+ using an escape sequence. GraphQL strings allow both C-style escape sequences
846
+ (for example ` \n ` ) and two forms of Unicode escape sequences: one with a
847
+ fixed-width of 4 hexadecimal digits (for example ` \u000A ` ) and one with a
848
+ variable-width most useful for representing a _ supplementary character_ such as
849
+ an Emoji (for example ` \u{1F4A9} ` ).
850
+
851
+ The hexadecimal number encoded by a Unicode escape sequence must describe a
852
+ _ Unicode scalar value_ , otherwise must result in a parse error. For example both
853
+ sources ` "\uDEAD" ` and ` "\u{110000}" ` should not be considered valid
854
+ {StringValue}.
855
+
856
+ Escape sequences are only meaningful within a single-quoted string. Within a
857
+ block string, they are simply that sequence of characters (for example
858
+ ` """\n""" ` represents the _ Unicode text_ [ U+005C, U+006E] ). Within a comment an
859
+ escape sequence is not a significant sequence of characters. They may not appear
860
+ elsewhere in a GraphQL document.
861
+
862
+ Since {StringCharacter} must not contain some code points directly (for example,
863
+ a {LineTerminator}), escape sequences must be used to represent them. All other
864
+ escape sequences are optional and unescaped non-ASCII Unicode characters are
865
+ allowed within strings. If using GraphQL within a system which only supports
866
+ ASCII, then escape sequences may be used to represent all Unicode characters
867
+ outside of the ASCII range.
868
+
869
+ For legacy reasons, a _ supplementary character_ may be escaped by two
870
+ fixed-width unicode escape sequences forming a _ surrogate pair_ . For example the
871
+ input ` "\uD83D\uDCA9" ` is a valid {StringValue} which represents the same
872
+ _ Unicode text_ as ` "\u{1F4A9}" ` . While this legacy form is allowed, it should be
873
+ avoided as a variable-width unicode escape sequence is a clearer way to encode
874
+ such code points.
875
+
876
+ When producing a {StringValue}, implementations should use escape sequences to
877
+ represent non-printable control characters (U+0000 to U+001F and U+007F to
878
+ U+009F). Other escape sequences are not necessary, however an implementation may
879
+ use escape sequences to represent any other range of code points (for example,
880
+ when producing ASCII-only output). If an implementation chooses to escape a
881
+ _ supplementary character_ , it should only use a variable-width unicode escape
882
+ sequence.
837
883
838
884
** Block Strings**
839
885
@@ -889,51 +935,84 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
889
935
quoted string with appropriate escape sequences must be used instead of a block
890
936
string.
891
937
892
- ** Semantics**
938
+ ** Static Semantics**
939
+
940
+ :: A {StringValue} describes a _ Unicode text_ value, which is a sequence of
941
+ _ Unicode scalar value_ .
942
+
943
+ These semantics describe how to apply the {StringValue} grammar to a source text
944
+ to evaluate a _ Unicode text_ . Errors encountered during this evaluation are
945
+ considered a failure to apply the {StringValue} grammar to a source and must
946
+ result in a parsing error.
893
947
894
948
StringValue :: ` "" `
895
949
896
950
- Return an empty sequence.
897
951
898
952
StringValue :: ` " ` StringCharacter+ ` " `
899
953
900
- - Return the sequence of all {StringCharacter} code points.
954
+ - Return the _ Unicode text_ by concatenating the evaluation of all
955
+ {StringCharacter}.
901
956
902
957
StringCharacter :: SourceCharacter but not ` " ` or ` \ ` or LineTerminator
903
958
904
- - Return the code point {SourceCharacter}.
959
+ - Return the _ Unicode scalar value _ {SourceCharacter}.
905
960
906
961
StringCharacter :: ` \u ` EscapedUnicode
907
962
908
- - Let {value} be the 16-bit hexadecimal value represented by the sequence of
909
- hexadecimal digits within {EscapedUnicode}.
910
- - Return the code point {value}.
963
+ - Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
964
+ within {EscapedUnicode}.
965
+ - Assert {value} is a within the _ Unicode scalar value_ range (>= 0x0000 and <=
966
+ 0xD7FF or >= 0xE000 and <= 0x10FFFF).
967
+ - Return the _ Unicode scalar value_ {value}.
968
+
969
+ StringCharacter :: ` \u ` HexDigit HexDigit HexDigit HexDigit ` \u ` HexDigit
970
+ HexDigit HexDigit HexDigit
971
+
972
+ - Let {leadingValue} be the hexadecimal value represented by the first sequence
973
+ of {HexDigit}.
974
+ - Let {trailingValue} be the hexadecimal value represented by the second
975
+ sequence of {HexDigit}.
976
+ - If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _ Leading Surrogate_ ):
977
+ - Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _ Trailing Surrogate_ ).
978
+ - Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
979
+ 0x10000.
980
+ - Otherwise:
981
+ - Assert {leadingValue} is within the _ Unicode scalar value_ range.
982
+ - Assert {trailingValue} is within the _ Unicode scalar value_ range.
983
+ - Return the sequence of the _ Unicode scalar value_ {leadingValue} followed by
984
+ the _ Unicode scalar value_ {trailingValue}.
985
+
986
+ Note: If both escape sequences encode a _ Unicode scalar value_ , then this
987
+ semantic is identical to applying the prior semantic on each fixed-width escape
988
+ sequence. A variable-width escape sequence must only encode a _ Unicode scalar
989
+ value_ .
911
990
912
991
StringCharacter :: ` \ ` EscapedCharacter
913
992
914
- - Return the code point represented by {EscapedCharacter} according to the table
915
- below.
993
+ - Return the _ Unicode scalar value _ represented by {EscapedCharacter} according
994
+ to the table below.
916
995
917
- | Escaped Character | Code Point | Character Name |
918
- | ----------------- | ---------- | ---------------------------- |
919
- | {` " ` } | U+0022 | double quote |
920
- | {` \ ` } | U+005C | reverse solidus (back slash) |
921
- | {` / ` } | U+002F | solidus (forward slash) |
922
- | {` b ` } | U+0008 | backspace |
923
- | {` f ` } | U+000C | form feed |
924
- | {` n ` } | U+000A | line feed (new line) |
925
- | {` r ` } | U+000D | carriage return |
926
- | {` t ` } | U+0009 | horizontal tab |
996
+ | Escaped Character | Scalar Value | Character Name |
997
+ | ----------------- | ------------ | ---------------------------- |
998
+ | {` " ` } | U+0022 | double quote |
999
+ | {` \ ` } | U+005C | reverse solidus (back slash) |
1000
+ | {` / ` } | U+002F | solidus (forward slash) |
1001
+ | {` b ` } | U+0008 | backspace |
1002
+ | {` f ` } | U+000C | form feed |
1003
+ | {` n ` } | U+000A | line feed (new line) |
1004
+ | {` r ` } | U+000D | carriage return |
1005
+ | {` t ` } | U+0009 | horizontal tab |
927
1006
928
1007
StringValue :: ` """ ` BlockStringCharacter\* ` """ `
929
1008
930
- - Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
931
- Unicode character values (which may be an empty sequence).
1009
+ - Let {rawValue} be the _ Unicode text _ by concatenating the evaluation of all
1010
+ {BlockStringCharacter} (which may be an empty sequence).
932
1011
- Return the result of {BlockStringValue(rawValue)}.
933
1012
934
1013
BlockStringCharacter :: SourceCharacter but not ` """ ` or ` \""" `
935
1014
936
- - Return the character value of {SourceCharacter}.
1015
+ - Return the _ Unicode scalar value _ {SourceCharacter}.
937
1016
938
1017
BlockStringCharacter :: ` \""" `
939
1018
0 commit comments