[editorial] Rephrase encoding note to make the implications clearer. #804

tabatkins · 2023-11-06T20:30:35Z

I didn't immediately realize the implications of this note and was very confused about how to handle the URL properly when doing some CSS work. I think this wording would have led me to the correct solution faster, so hopefully it'll help others.

(Note: I used a 100 character wrap, as that's what the rest of the spec appears to use. I also used Bikeshed'd [=foo=] syntax, but the rest of the spec seems to prefer explicit <a>s. I can switch if you want.)

Preview | Diff

annevk · 2025-03-24T08:58:37Z

url.bs

-<a lt="percent-encoded byte">percent-encoded bytes</a> by the <a>URL parser</a>.
+<p class=note>For historical reasons, rather than storing codepoints and [=byte/percent-encoding=]
+to ASCII for serialization, URLs instead store their value as ASCII internally, eagerly converting
+code points greater than U+007F DELETE to [=percent-encoded bytes=] during [=URL parser|parsing=].


Apologies for not responding to this more quickly, but I think I never ended up merging it because I'm not sure this is correct. I suspect one could convert at serialization time instead. It's just not how the specification is written.

Sure, the spec could be written another way (potentially), but it's currently not written that way, and the specifics of how the data is encoded/represented at this point in the spec are important, so I know that the URL structure only includes ASCII code points. If we changed to a "convert at serialization" model, that would also be important to note, so it was clear that the URL structure includes non-ASCII code points.

As I said, the nature of this note actively confused me - the spec talks about "URL code points" including non-ASCII codepoints, but URLs themselves do not contain these code points, and that wasn't clear to me from how the note was written.

Hmm. 1) It's not for "historical reasons". 2) This section is really about writing URLs, it isn't really about their internal representation at all. That's section 4.1 and that already makes it clear most components are ASCII strings.

The "for historical reasons" was me assuming and editorializing. (It seemed like a weird thing to do! It's not usually good practice to encode into the byte format immediately; usually you hold it in the good data model and only encode at the edges, when you have to hit the wire.) If that's not true, and it really is just a quirk of the model, I can rephrase that bit.

And this section is about writing URLs, sure, but there was already a note about how those codepoints you write will be encoded. I was just rewriting the note for (imo) better clarity. If there's a better place to make this note, I can move it there, but this section does seem relatively germane to what the note is saying (since URLs can "contain" high codepoints, but the actual internal representation is ASCII-only).

tabatkins added 2 commits November 6, 2023 12:21

Rephrase encoding note to make the implications clearer.

f6e66f2

Fix linking error

a81fcda

annevk reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[editorial] Rephrase encoding note to make the implications clearer. #804

[editorial] Rephrase encoding note to make the implications clearer. #804

tabatkins commented Nov 6, 2023 •

edited by pr-preview bot

Loading

annevk Mar 24, 2025

tabatkins Mar 24, 2025

annevk Mar 25, 2025

tabatkins Mar 25, 2025

[editorial] Rephrase encoding note to make the implications clearer. #804

Are you sure you want to change the base?

[editorial] Rephrase encoding note to make the implications clearer. #804

Conversation

tabatkins commented Nov 6, 2023 • edited by pr-preview bot Loading

annevk Mar 24, 2025

Choose a reason for hiding this comment

tabatkins Mar 24, 2025

Choose a reason for hiding this comment

annevk Mar 25, 2025

Choose a reason for hiding this comment

tabatkins Mar 25, 2025

Choose a reason for hiding this comment

tabatkins commented Nov 6, 2023 •

edited by pr-preview bot

Loading