-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[editorial] Rephrase encoding note to make the implications clearer. #804
base: main
Are you sure you want to change the base?
Conversation
<a lt="percent-encoded byte">percent-encoded bytes</a> by the <a>URL parser</a>. | ||
<p class=note>For historical reasons, rather than storing codepoints and [=byte/percent-encoding=] | ||
to ASCII for serialization, URLs instead store their value as ASCII internally, eagerly converting | ||
code points greater than U+007F DELETE to [=percent-encoded bytes=] during [=URL parser|parsing=]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not responding to this more quickly, but I think I never ended up merging it because I'm not sure this is correct. I suspect one could convert at serialization time instead. It's just not how the specification is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the spec could be written another way (potentially), but it's currently not written that way, and the specifics of how the data is encoded/represented at this point in the spec are important, so I know that the URL structure only includes ASCII code points. If we changed to a "convert at serialization" model, that would also be important to note, so it was clear that the URL structure includes non-ASCII code points.
As I said, the nature of this note actively confused me - the spec talks about "URL code points" including non-ASCII codepoints, but URLs themselves do not contain these code points, and that wasn't clear to me from how the note was written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. 1) It's not for "historical reasons". 2) This section is really about writing URLs, it isn't really about their internal representation at all. That's section 4.1 and that already makes it clear most components are ASCII strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "for historical reasons" was me assuming and editorializing. (It seemed like a weird thing to do! It's not usually good practice to encode into the byte format immediately; usually you hold it in the good data model and only encode at the edges, when you have to hit the wire.) If that's not true, and it really is just a quirk of the model, I can rephrase that bit.
And this section is about writing URLs, sure, but there was already a note about how those codepoints you write will be encoded. I was just rewriting the note for (imo) better clarity. If there's a better place to make this note, I can move it there, but this section does seem relatively germane to what the note is saying (since URLs can "contain" high codepoints, but the actual internal representation is ASCII-only).
I didn't immediately realize the implications of this note and was very confused about how to handle the URL properly when doing some CSS work. I think this wording would have led me to the correct solution faster, so hopefully it'll help others.
(Note: I used a 100 character wrap, as that's what the rest of the spec appears to use. I also used Bikeshed'd
[=foo=]
syntax, but the rest of the spec seems to prefer explicit<a>
s. I can switch if you want.)Preview | Diff