Translate non-UTF-8 byte sequences into replacement characters #60

sunfishcode · 2020-06-15T18:56:11Z

This is a work in progress; I'm posting it now to provide context for #58.

This patch series fixes all the cases I could find where vte would silently discard bytes in the presence of invalid UTF-8 byte sequences, and makes it emit replacement characters in a manner compatible with Rust's String::from_utf8_lossy and the resolution to https://hsivonen.fi/broken-utf-8/, and fixes #38.

There are other ways I could achieve my goals here. Building this patch series was useful for me to understand all the different aspects of this. But if taking a different approach would be better, I'm open to ideas.

The UTF-8 parser knows how to handle invalid byte sequences, so don't preprocess the input in the main parser; just hand any non-ASCII byte to the UTF-8 parser to handle. This includes what were previously interpreted as 8-bit C1 control codes; they are now interpreted as UTF-8 continuation characters.

Follow the Unicode recommenatation, which is also consistent with Rust's `String::from_utf8_lossy`, concerning when and where to emit replacement characters. See https://hsivonen.fi/broken-utf-8/ for a detailed description of the issue.

In settings where an input stream has a logical end, allow it to be reported to the parser so that it can correctly process an incomplete UTF-8 sequence.

Add tests derived from https://hsivonen.fi/broken-utf-8/test.html .

This way if a transition function detects an invalid UTF-8 sequence, it can reset the main state to Ground.

DCS/SOS/PM/APC are terminated with an 8-bit ST, which isn't compatible with UTF-8, so add a flag to allow users to optionally disable recognition of these sequences. To make room in the `Action` list, remove `Ignore` as it's effectively the same as `None`.

sunfishcode · 2020-06-16T00:25:16Z

I've now added a way to disable recognition of DCS, SOS, PM, and APC sequences, since they are all terminated by 8-bit ST, which isn't valid UTF-8. They are still recognized by default, but the patch adds a flag for optionally disabling them.

This is now ready for review feedback. There are several changes here, though they are split up into self-contained commits. I'm curious whether these kinds of changes are in scope for vte, and whether the approach I took here makes sense.

kchibisov · 2020-06-16T00:31:07Z

Have you run benchmarks against your changes? You can do that with cargo bench --features=nightly and compare them to master.

chrisduerr · 2020-06-16T15:24:16Z

I've now added a way to disable recognition of DCS, SOS, PM, and APC sequences, since they are all terminated by 8-bit ST, which isn't valid UTF-8.

I don't quite understand what you mean and having a flag to disable these doesn't make much sense to me for a vt parser. Correctly parsing escapes is the one thing we're supposed to do.

First of all, only because an escape is terminated with ST, doesn't mean it has to be the C1 ST. The C0 one should work perfectly fine. Even if ST is listed, BEL often works just as well.

All C1 control characters are also valid UTF-8. So I have absolutely no idea what you mean by this.

sunfishcode · 2020-06-16T18:54:43Z

@kchibisov

Have you run benchmarks against your changes? You can do that with cargo bench --features=nightly and compare them to master.

There's currently about a 20% slowdown in that benchmark as a result of this PR. That said, that test contains quite a lot of byte sequences which are not valid UTF-8, and with this PR, those sequences are translated into replacement characters rather than being ignored, so it's doing more work. How important is performance on binary or otherwise non-UTF-8 input for vte?

On test files containing valid UTF-8, it ranges from having no slowdown to about a 5% slowdown, depending on the content. I'll take a look to see if there are any opportunities to optimize.

@chrisduerr

I don't quite understand what you mean and having a flag to disable these doesn't make much sense to me for a vt parser. Correctly parsing escapes is the one thing we're supposed to do.

Some users will want to parse DCS, SOS, PM, and APC. But many terminals these days don't do anything with DCS, and I don't know of any that do anything with SOS, PM, or APC. So if these codes interfere with things that other users want, it doesn't seem unreasonable to have a way to disable them.

First of all, only because an escape is terminated with ST, doesn't mean it has to be the C1 ST. The C0 one should work perfectly fine. Even if ST is listed, BEL often works just as well.

All C1 control characters are also valid UTF-8. So I have absolutely no idea what you mean by this.

In a C0 ST, the ESC cancels the sequence, before the \ is seen, so it doesn't work the same way.

When they extended OSC to recognize BEL instead of ST, I don't know why they didn't do the same for DCS, SOS, PM, or APC. But it seems they didn't; I can't find anything that recognizes BEL for those.

vte doesn't recognize the UTF-8 C1 encodings as control codes. In the SosPmApcString state, 0xC2 defaults to being ignored. A subsequent 0x9C looks like an 8-bit C1, which is what we're trying to avoid.

chrisduerr · 2020-06-16T21:31:51Z

How important is performance on binary or otherwise non-UTF-8 input for vte?

Performance is very important.

On test files containing valid UTF-8, it ranges from having no slowdown to about a 5% slowdown

A slowdown of 5% for input that isn't affected at all would probably block this PR. Though it depends a bit on how the performance is in Alacritty. But that's quite a big slowdown.

In a C0 ST, the ESC cancels the sequence, before the \ is seen, so it doesn't work the same way.

It doesn't work the same way, but it still works. So why would we remove support for things with no benefit of doing so? If the goal is only to remove C1 support, we could still support it with C0 just fine?

sunfishcode · 2020-06-16T22:40:41Z

How important is performance on binary or otherwise non-UTF-8 input for vte?

Performance is very important.

My question was about binary or otherwise non-UTF-8 data specifically. If that's important, then it's unlikely that any amount of optimization will fix it, because the point of this PR is to do more work in those cases.

A slowdown of 5% for input that isn't affected at all would probably block this PR. Though it depends a bit on how the performance is in Alacritty. But that's quite a big slowdown.

I haven't spent much time on optimization yet, so it's possible there are opportunities to improve performance. However, it sounds like you're not interested in this PR, so I'll look into other options.

In a C0 ST, the ESC cancels the sequence, before the \ is seen, so it doesn't work the same way.

It doesn't work the same way, but it still works. So why would we remove support for things with no benefit of doing so? If the goal is only to remove C1 support, we could still support it with C0 just fine?

It isn't just about whether there exists a byte sequence which has the desired effect; it's also about what applications are emitting, and if they make a change, how that change would work in other terminals they work in. From what I can tell, C0 ST causes DCS strings to be cancelled in other terminals, so if existing applications using DCS with C1 ST are going to make any change at all, it'd make more sense for them to stop using DCS altogether. Since DCS is obscure and not widely supported these days anyway, moving away from it in general seems to make more sense.

chrisduerr · 2020-06-17T00:48:16Z

My question was about binary or otherwise non-UTF-8 data specifically. If that's important, then it's unlikely that any amount of optimization will fix it, because the point of this PR is to do more work in those cases.

Of course the performance of that matters too, but if there's no effect on other cases a small hit is fine there. It's not necessary to have absolutely perfect performance when it comes to parsing invalid input.

I haven't spent much time on optimization yet, so it's possible there are opportunities to improve performance. However, it sounds like you're not interested in this PR, so I'll look into other options.

I haven't looked at the PR, I'm mostly curious about your motivation and methodology. Just to be absolutely clear that my statements were purely about what you said, not the code itself.

It isn't just about whether there exists a byte sequence which has the desired effect; it's also about what applications are emitting, and if they make a change, how that change would work in other terminals they work in. From what I can tell, C0 ST causes DCS strings to be cancelled in other terminals, so if existing applications using DCS with C1 ST are going to make any change at all, it'd make more sense for them to stop using DCS altogether. Since DCS is obscure and not widely supported these days anyway, moving away from it in general seems to make more sense.

I'm not entirely sure what you're trying to say here. Are you saying that since C0 ST isn't supported by other terminals, applications that use the C1 ST have no reason to make any changes? I don't see why applications would make any change in the first place, as a parser we should support C1 ST just fine if the terminal chooses to use it. I don't see many reasons other than performance to drop it and based on my testing there is no performance benefit to dropping C1 support completely. Though we don't have C1 ST support for OSC either, which was necessary to fix a bug.

sunfishcode · 2020-06-17T14:56:16Z

I'm not entirely sure what you're trying to say here. Are you saying that since C0 ST isn't supported by other terminals, applications that use the C1 ST have no reason to make any changes? I don't see why applications would make any change in the first place, as a parser we should support C1 ST just fine if the terminal chooses to use it. I don't see many reasons other than performance to drop it and based on my testing there is no performance benefit to dropping C1 support completely. Though we don't have C1 ST support for OSC either, which was necessary to fix a bug.

For my use cases, I'm looking for a parser in which every invalid UTF-8 byte sequence can be replaced with U+FFFD. If the input contains a 0x9C anywhere that isn't in a UTF-8 continuation position, I'm going to replace it, so I won't recognize it at ST.

To me, 8-bit C1 codes mean increased surface area and weaker invariants. Being able to say that every non-UTF-8 byte sequence gets translated to U+FFFD means I can categorically rule out a class of potential bugs, such as the one you mentioned. One of the properties of UTF-8 is that it's self-synchronizing, so if I ever have to start decoding in the middle of a stream, I can always at least know where the code-point boundaries are, but that property doesn't completely hold in "UTF-8 with 8-bit C1 codes". And, some terminal implementations today do UTF-8 decoding as an independent step before escape-sequence decoding, so even if I'm using a library that doesn't do that today, I want the option of doing so in the future without having to worry about backwards compatibility.

Some of these risks are arguably minor, but the 8-bit C1 codes sufficiently obscure and insufficiently useful that, for my purposes at least, they not worth it.

chrisduerr · 2020-06-17T23:34:44Z

Being able to say that every non-UTF-8 byte sequence gets translated to U+FFFD means I can categorically rule out a class of potential bugs

I don't see how the user of VTE would be able to rule out any bugs with the help of that. It's just dropped vs replaced for them.

Some of these risks are arguably minor, but the 8-bit C1 codes sufficiently obscure and insufficiently useful that, for my purposes at least, they not worth it.

The thing is that vte is a library, so the goal is to try and accommodate for a wide range of usecases, not just the one specific to Alacritty for example. We personally have removed all support for C1 escapes too, but I'm not sure if I want to do that to vte.

Maybe the current downstream users of vte should also be evaluated. If it really causes so much overhead and nobody actually uses it, maybe giving up on C1 is the better option. There's currently not really proper C1 support in vte, so maybe we're just making things worse for everyone by clinging to some halfway state that doesn't help anyone who wants C1 support but holds everyone else back.

I know that other terminals print replacement characters, are you aware of how these handle C1 escapes?

sunfishcode · 2020-06-18T23:24:16Z

I don't see how the user of VTE would be able to rule out any bugs with the help of that. It's just dropped vs replaced for them.

It would be simpler if there were a spec to follow. While the VT100 state machine diagram is a great starting point, it doesn't incoporate UTF-8. VTE is currently making its own choices about when and where to support 8-bit C1 controls. Will these choices be compatible with the choices my users are making? If not, is it my (or VTE's) bug or theirs? If I switch to a different parser, will I have subtle incompatibilities?

Maybe the current downstream users of vte should also be evaluated. If it really causes so much overhead and nobody actually uses it, maybe giving up on C1 is the better option. There's currently not really proper C1 support in vte, so maybe we're just making things worse for everyone by clinging to some halfway state that doesn't help anyone who wants C1 support but holds everyone else back.

I agree, this sounds like a good evaluation to do.

I know that other terminals print replacement characters, are you aware of how these handle C1 escapes?

According to Wikipedia, 0x85 the 8-bit C1 code might be still in use. I checked 9 different virtual terminals, plus screen and tmux, and the behavior ranges from silently ignoring it, printing ▒, printing �, and one implementation which supports it, which is xterm, and only in its non-UTF-8 mode.

chrisduerr · 2020-06-19T20:33:19Z

According to Wikipedia, 0x85 the 8-bit C1 code might be still in use.

Based on my testing, there's enough that don't support it that you cannot rely on it. Which seems in line with what you've seen. IIRC by default neither URxvt, VTE, Kitty nor XTerm support it.

So I'd say there are two questions mainly:

If we decide that it should be possible to support C1 escapes with VTE, should it be possible at runtime or only at compile-time
Should we support C1 escapes at all

Since you've looked into this already, you can probably best evaluate what overhead a runtime configuration would have. Do you think that it would be any slower than a compile-time option?

And is there any performance advantage to completely dropping C1 support, or does the compile-time option only add code complexity without affecting performance?

sunfishcode · 2020-06-22T17:04:14Z

Right now, it looks like VTE's state machine is a fairly literal translation of the DEC state machine diagram, with a separate state machine for UTF-8. As it is now, removing 8-bit C1 support probably wouldn't help performance much.

There are several optimizations possible. For example, when in the Utf8 state and a 0x80 arrives, it should be able to dispatch directly into the code for handling UTF-8 continuation bytes. Or when in the Ground state and a 0x80 arrives, it should be possible to dispatch directly into the code for handling an invalid UTF-8 encoding. These kinds of optimizations would be either less effective or more complex, or both, with 8-bit C1 support.

The runtime option I experimented with for disabling DCS/SOS/PM/APC looks to have caused about a 5% slowdown. It seems likely that a runtime option to disable/enable 8-bit C1 support would be similar.

Should we support C1 escapes at all

Of course, this is just my opinion, but for what it's worth, I suggest "no" here, unless you know of a specific user that needs it. UTF-8 is so important in so many contexts these days, it should take a really strong justification to support exceptions to UTF-8 in a protocol which is otherwise UTF-8.

As an update to a comment I made earlier, it appears XTerm in UTF-8 mode (the default mode in many OS's today) does recognize ESC \ as a string terminator for DCS/SOS/PM/APC, and doesn't recognize 0x9C. With that, it's possible to support the full feature set including DCS etc. within a valid UTF-8 bytestream, and it's compatible with XTerm. So for what it's worth, that's the path I think that makes the most sense.

chrisduerr · 2020-06-22T21:44:55Z

There are several optimizations possible. For example, when in the Utf8 state and a 0x80 arrives, it should be able to dispatch directly into the code for handling UTF-8 continuation bytes. Or when in the Ground state and a 0x80 arrives, it should be possible to dispatch directly into the code for handling an invalid UTF-8 encoding. These kinds of optimizations would be either less effective or more complex, or both, with 8-bit C1 support.

That's a good point and I've considered this myself in the past. It seems like there would be some significant changes necessary, but with a tiny lib like vte that shouldn't be a massive problem either.

Of course, this is just my opinion, but for what it's worth, I suggest "no" here, unless you know of a specific user that needs it. UTF-8 is so important in so many contexts these days, it should take a really strong justification to support exceptions to UTF-8 in a protocol which is otherwise UTF-8.

I think I mostly agree at this point. Especially with the primary user of VTE not using C1 escapes at all. Looking at the RedOX and strip-ansi-escape usages of VTE it seems like they do the same. So that does raise the question why we should bother with C1 escapes at all when we can make our library better for all currently known users by dropping it.

As an update to a comment I made earlier, it appears XTerm in UTF-8 mode (the default mode in many OS's today) does recognize ESC \ as a string terminator for DCS/SOS/PM/APC, and doesn't recognize 0x9C. With that, it's possible to support the full feature set including DCS etc. within a valid UTF-8 bytestream, and it's compatible with XTerm. So for what it's worth, that's the path I think that makes the most sense.

That does seem to be the most reasonable solution, yes.

sunfishcode mentioned this pull request Jun 15, 2020

Process all non-ASCII bytes with the UTF-8 parser. #58

Closed

sunfishcode force-pushed the utf8 branch from e9c459f to 476ba17 Compare June 15, 2020 19:42

sunfishcode added 7 commits June 15, 2020 16:48

Add a Parser::end function to handle end-of-stream events.

3c766ed

In settings where an input stream has a logical end, allow it to be reported to the parser so that it can correctly process an incomplete UTF-8 sequence.

Add more UTF-8 replacement-character tests.

c3e6a47

Add tests derived from https://hsivonen.fi/broken-utf-8/test.html .

Allow transition functions to modify the main parser state.

07aed4a

This way if a transition function detects an invalid UTF-8 sequence, it can reset the main state to Ground.

Update documentation to mention changes for UTF-8.

1ff5500

sunfishcode force-pushed the utf8 branch from 476ba17 to 1ff5500 Compare June 15, 2020 23:53

sunfishcode marked this pull request as ready for review June 16, 2020 00:03

sunfishcode closed this Jun 16, 2020

sunfishcode deleted the utf8 branch June 16, 2020 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate non-UTF-8 byte sequences into replacement characters #60

Translate non-UTF-8 byte sequences into replacement characters #60

sunfishcode commented Jun 15, 2020

sunfishcode commented Jun 16, 2020

kchibisov commented Jun 16, 2020

chrisduerr commented Jun 16, 2020 •

edited

Loading

sunfishcode commented Jun 16, 2020

chrisduerr commented Jun 16, 2020 •

edited

Loading

sunfishcode commented Jun 16, 2020

chrisduerr commented Jun 17, 2020

sunfishcode commented Jun 17, 2020

chrisduerr commented Jun 17, 2020

sunfishcode commented Jun 18, 2020

chrisduerr commented Jun 19, 2020

sunfishcode commented Jun 22, 2020

chrisduerr commented Jun 22, 2020

Translate non-UTF-8 byte sequences into replacement characters #60

Translate non-UTF-8 byte sequences into replacement characters #60

Conversation

sunfishcode commented Jun 15, 2020

sunfishcode commented Jun 16, 2020

kchibisov commented Jun 16, 2020

chrisduerr commented Jun 16, 2020 • edited Loading

sunfishcode commented Jun 16, 2020

chrisduerr commented Jun 16, 2020 • edited Loading

sunfishcode commented Jun 16, 2020

chrisduerr commented Jun 17, 2020

sunfishcode commented Jun 17, 2020

chrisduerr commented Jun 17, 2020

sunfishcode commented Jun 18, 2020

chrisduerr commented Jun 19, 2020

sunfishcode commented Jun 22, 2020

chrisduerr commented Jun 22, 2020

chrisduerr commented Jun 16, 2020 •

edited

Loading

chrisduerr commented Jun 16, 2020 •

edited

Loading