You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: types-grammar/ch1.md
+74-10
Original file line number
Diff line number
Diff line change
@@ -226,7 +226,7 @@ JS does not distinguish a single character as a different type as some languages
226
226
227
227
Strings can be delimited by double-quotes (`"`), single-quotes (`'`), or back-ticks (`` ` ``). The ending delimiter must always match the starting delimiter.
228
228
229
-
Strings have an intrinsic length which corresponds to how many code-points -- actually, code-units, more on that in a moment -- they contain.
229
+
Strings have an intrinsic length which corresponds to how many code-points -- actually, code-units, more on that in a bit -- they contain.
230
230
231
231
```js
232
232
myName = "Kyle";
@@ -236,21 +236,33 @@ myName.length; // 4
236
236
237
237
This does not necessarily correspond to the number of visible characters present between the start and end delimiters (aka, the string literal). It can sometimes be a little confusing to keep straight the difference between a string literal and the underlying string value, so pay close attention.
238
238
239
+
| NOTE: |
240
+
| :--- |
241
+
| We'll cover length computation of strings in detail, in Chapter 2. |
242
+
239
243
### JS Character Encodings
240
244
241
245
What type of character encoding does JS use for string characters?
242
246
243
-
One might assume UTF-8 (8-bit) or UTF-16 (16-bit). It's actually more complicated, because you also need to consider UCS-2 (2-byte Universal Character Set), which is similar to UTF-16, but not quite the same. [^UTFUCS]
247
+
You've probably heard of "Unicode" and perhaps even "UTF-8" (8-bit) or "UTF-16" (16-bit). If you're like me (before doing the research it took to write this text), you might have just hand-waved and decided that's all you need to know about character encodings in JS strings.
248
+
249
+
But... it's not. Not even close.
244
250
245
-
The first group of 65,535 code points in Unicode is called the BMP (Basic Multilingual Plane). All the rest of the code points are grouped into 16 so called "supplemental planes" or "astral planes". When representing Unicode characters from the BMP, it's pretty straightforward, as they can *fit* neatly into single JS characters.
251
+
It turns out, you need to understand how a variety of aspects of Unicode work, and even to consider concepts from UCS-2 (2-byte Universal Character Set), which is similar to UTF-16, but not quite the same. [^UTFUCS]
246
252
247
-
But when representing extended characters outside the BMP, JS actually represents these characters code-points as a pairing of two separate code units, called *surrogate halves*.
253
+
Unicode defines all the "characters" we can represent universally in computer programs, by assigning a specific number to each, called code-points. These numbers range from `0` all the way up to a maximum of `1114111` (`10FFFF` in hexadecimal).
248
254
249
-
For example, the Unicode code point `127878` (hexadecimal `1F386`) is `🎆` (fireworks symbol). JS stores this in the string value as two surrogate-halve code units: `U+D83C` and `U+DF86`.
255
+
The standard notation for Unicode characters is `U+` followed by 4 (or 6) hexadecimal characters. For example, the `❤` (heart symbol) is code-point `10084` (`2764` in hexadecimal), and is thus annotated with `U+2764`.
256
+
257
+
The first group of 65,535 code points in Unicode is called the BMP (Basic Multilingual Plane). These can all be represented with 16 bits (2 bytes). When representing Unicode characters from the BMP, it's fairly straightforward, as they can *fit* neatly into single UTF-16 JS characters.
258
+
259
+
All the rest of the code points are grouped into 16 so called "supplemental planes" or "astral planes". These code-points require more than 16 bits to represent -- 21 bits to be exact -- so when representing extended/supplemental characters above the BMP, JS actually stores these code-points as a pairing of two adjacent 16-bit code units, called *surrogate halves*.
260
+
261
+
For example, the Unicode code point `127878` (hexadecimal `1F386`) is `🎆` (fireworks symbol). JS stores this in a string value as two surrogate-halve code units: `U+D83C` and `U+DF86`.
250
262
251
263
This has implications on the length of strings, because a single visible character like the `🎆` fireworks symbol, when in a JS string, is a counted as 2 characters for the purposes of the string length!
252
264
253
-
We'll revisit Unicode characters shortly.
265
+
We'll revisit Unicode characters in a bit, and then cover more accurately computing string length in Chapter 2.
254
266
255
267
### Escape Sequences
256
268
@@ -305,13 +317,15 @@ For any normal character that can be typed on a keyboard, such as `"a"`, it's us
305
317
"a"==="\x61"; // true
306
318
```
307
319
308
-
#### Unicode
320
+
#### Unicode In Strings
321
+
322
+
Unicode escape sequences alone can encode any of the characters from the Unicode BMP. They look like `\u` followed by exactly four hexadecimal characters.
When any character-escape sequence (regardless of length) is recognized, the single character it represents is inserted into the string, rather than the original separate characters. So, in the string `"\u263A"`, there's only one (smiley) character, not six individual characters.
313
327
314
-
Unicode code-points can go well above `65535` (`FFFF` in hexadecimal), up to a maximum of `1114111` (`10FFFF` in hexadecimal). For example, `1F4A9` (or `1f4a9`)is decimal code-point `128169`, which corresponds to the funny `💩` (pile of poo) symbol.
328
+
But as explained earlier, many Unicode code-points are well above `65535`. For example, `1F4A9` (or `1f4a9`)is decimal code-point `128169`, which corresponds to the funny `💩` (pile-of-poo) symbol.
315
329
316
330
But `\u1F4A9` wouldn't work to include this character in a string, since it would be parsed as the Unicode escape sequence `\u1F4A`, followed by a literal `9` character. To address this limitation, a variation of Unicode escape sequences was introduced to allow an arbitrary number of hexadecimal characters after the `\u`, by surrounding them with `{ .. }` curly braces:
317
331
@@ -322,7 +336,7 @@ console.log(myReaction);
322
336
// 💩
323
337
```
324
338
325
-
Recall the earlier discussion of extended (non-BMP) Unicode characters and *surrogate halves*? The same `💩` could also be defined with the two explicit code-units:
339
+
Recall the earlier discussion of extended (non-BMP) Unicode characters and *surrogate halves*? The same `💩` could also be defined with two explicit code-units, that form a surrogate pair:
326
340
327
341
```js
328
342
myReaction ="\uD83D\uDCA9";
@@ -340,6 +354,56 @@ All three representations of this same character are stored internally by JS ide
340
354
341
355
Even though JS doesn't care which way such a character is represented in your program, consider the readability differences carefully when authoring your code.
342
356
357
+
| NOTE: |
358
+
| :--- |
359
+
| Even though `💩` looks like a single character, its internal representation affects things like the length computation of a string with that character in it. We'll cover length computation of strings in Chapter 2. |
360
+
361
+
##### Unicode Normalization
362
+
363
+
A further wrinkle in Unicode string handling is that even certain single BMP characters can be represented in different ways.
364
+
365
+
For example, the `"é"` character can either be represented as itself (code-point `233`, aka `\xe9` or `\u00e9` or `\u{e9}`), or as the combination of two code-points: the `"e"` character (code-point `101`, aka `\x65`, `\u0065`, `\u{65}`) and the *combining tilde* (code-point `769`, aka `\u0301`, `\u{301}`).
366
+
367
+
Consider:
368
+
369
+
```js
370
+
eTilde1 ="é";
371
+
eTilde2 ="\u00e9";
372
+
eTilde3 ="\u0065\u0301";
373
+
374
+
console.log(eTilde1); // é
375
+
console.log(eTilde2); // é
376
+
console.log(eTilde3); // é
377
+
```
378
+
379
+
However, the way the `"é"` character is internally stored affects things like `length` computation of the containing string, as well as equality comparison:
380
+
381
+
```js
382
+
eTilde1.length; // 2
383
+
eTilde2.length; // 1
384
+
eTilde3.length; // 2
385
+
386
+
eTilde1 === eTilde2; // false
387
+
eTilde1 === eTilde3; // true
388
+
```
389
+
390
+
This internal representation difference can be quite challenging if not carefully planned for. Fortunately, JS provides a `normalize(..)` utility method on strings to help:
391
+
392
+
```js
393
+
eTilde1 ="é"
394
+
eTilde2 ="\u{e9}";
395
+
eTilde3 ="\u{65}\u{301}";
396
+
397
+
eTilde1.normalize("NFC") === eTilde2;
398
+
eTilde2.normalize("NFD") === eTilde3;
399
+
```
400
+
401
+
The `"NFC"` normalization mode combines adjacent code-points into the *composed* code-point (if possible), whereas the `"NFD"` normalization mode splits a single code-point into its *decomposed* code-points (if possible).
402
+
403
+
And there can actually be more than two individual *decomposed* code-points that make up a single *composed* code-point; some international language symbols (Chinese, Japanese, etc) are *composed* of three or four code-points layered together!
404
+
405
+
When dealing with Unicode strings that will be compared, sorted, or length analyzed, it's very important to keep Unicode normalization in mind, and use it where necessary.
406
+
343
407
### Line Continuation
344
408
345
409
The `\` followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
String values have a number of specific behaviors that every JS developer should be aware of.
134
134
135
-
As previously mentioned, string values have a `length` property that automatically exposes the number of characters (actually, code units). This property can only be accessed; attempts to set it are silently ignored.
135
+
### Length Computation
136
+
137
+
As mentioned in Chapter 1, string values have a `length` property that automatically exposes the length of the string; this property can only be accessed; attempts to set it are silently ignored.
138
+
139
+
The reported `length` value somewhat corresponds to the number of characters in the string (actually, code-units), but as we saw in Chapter 1, it's more complex when Unicode characters are involved.
140
+
141
+
Most people visually distinguish symbols as separate characters; this notion of an independent visual symbol is referred to as a *grapheme*. So when counting the "length" of a string, we typically mean that we're counting the number of graphemes.
142
+
143
+
But that's not how the computer deals with characters.
144
+
145
+
In JS, each *character* is a code-unit (16 bits), with a code-point value at or below `65535`. The `length` property of a string always counts the number of code-units in the string value, not code-points. A code-unit might represent a single character by itself, or it may be part of a surrogate pair, or it may be combined with an adjacent *combining* symbol. As such, `length` doesn't match the typical notion of counting graphemes.
146
+
147
+
To obtain a *grapheme length* for a string that matches typical expectations, the string value first needs to be normalized with `normalize("NFC")` (see "Normalizing Unicode" in Chapter 1) to produce *composed* code-units, in case any characters in it were originally stored *decomposed* as separate code-units.
148
+
149
+
// TODO
136
150
137
151
### String Character Access
138
152
139
-
Though strings are not actually arrays, JS allows `[ .. ]` array-style access of its character at a numeric (`0`-based) index:
153
+
Though strings are not actually arrays, JS allows `[ .. ]` array-style access of a character at a numeric (`0`-based) index:
140
154
141
155
```js
142
156
greeting ="Hello!";
@@ -170,6 +184,43 @@ The `+` operator will act as a string concatenation if either of the two operand
170
184
171
185
If one operand is a string and the other is not, the one that's not a string will be coerced to its string representation for the purposes of the concatenation.
172
186
187
+
### Character Iteration
188
+
189
+
Strings are not arrays, but they certainly mimick arrays closely in many ways. One such behavior is that, like arrays, strings are iterables. This means that the characters (code-units) of a string can be iterated individually:
190
+
191
+
```js
192
+
myName ="Kyle";
193
+
194
+
for (let char of myName) {
195
+
console.log(char);
196
+
}
197
+
// K
198
+
// y
199
+
// l
200
+
// e
201
+
202
+
chars = [ ...myName ];
203
+
chars;
204
+
// [ "K", "y", "l", "e" ]
205
+
```
206
+
207
+
Values, such as strings and arrays, are iterables (via `...`, `for..of`, and `Array.from(..)`), if they expose an iterator-producing method at the special symbol property location `Symbol.iterator` (see "Well-Known Symbols" in Chapter 1):
208
+
209
+
```js
210
+
myName ="Kyle";
211
+
it = myName[Symbol.iterator]();
212
+
213
+
it.next(); // { value: "K", done: false }
214
+
it.next(); // { value: "y", done: false }
215
+
it.next(); // { value: "l", done: false }
216
+
it.next(); // { value: "e", done: false }
217
+
it.next(); // { value: undefined, done: true }
218
+
```
219
+
220
+
| NOTE: |
221
+
| :--- |
222
+
| The specifics of the iterator protocol, including the fact that the `{ value: "e" .. }` result still shows `done: false`, are covered in detail in the "Sync & Async" title of this series. |
223
+
173
224
### String Methods
174
225
175
226
Strings provide a whole slew of additional string-specific methods (as properties):
@@ -212,8 +263,14 @@ Strings provide a whole slew of additional string-specific methods (as propertie
212
263
213
264
*`replace(..)`: returns a new string with a replacement from the original string, of one or more matching occurrences of the specified regular-expression match
214
265
266
+
*`normalize(..)`: produces a new string with Unicode normalization (see "Unicode Normalization" in Chapter 1) having been performed on the contents
267
+
215
268
*`big()`, `blink()`, `bold()`, `fixed()`, `fontcolor()`, `fontsize()`, `italics()`, `link()`, `small()`, `strike()`, `sub()`, and `sup()`: historically, these were useful in generating HTML string snippets; they're now deprecated and should be avoided
216
269
270
+
| WARNING: |
271
+
| :--- |
272
+
| Many of the methods described above rely on position indices. As mentioned earlier in the "Length Computations" section, these positions are dependent on the internal contents of the string value, which means that if an extended Unicode character is present and takes up two code-unit slots, that will count as two index positions instead of one. Failing to account for Unicode surrogate pairs is a common source of bugs in JS string handling, especially when dealing with non-English internationalized language characters. |
273
+
217
274
### Static String Helpers
218
275
219
276
The following string utility functions are proviced directly on the `String` object, rather than as methods on individual string values:
0 commit comments