Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process all non-ASCII bytes with the UTF-8 parser. #58

Closed
wants to merge 1 commit into from

Conversation

sunfishcode
Copy link
Contributor

The UTF-8 parser knows how to handle invalid byte sequences, so don't
preprocess the input in the main parser; just hand any non-ASCII byte to
the UTF-8 parser to handle.

This includes what were previously interpreted as 8-bit C1 control codes;
they are now interpreted as UTF-8 continuation characters.

This is the first in a series of patches which together fix #38 and a few related issues, in total enabling vte to process non-UTF-8 bytes in a manner consistent with Rust's String::from_utf8_lossy.

Removing support for the 8-bit C1 codes isn't necessary to achieve this, and I can revise this patch to restore support if desired, but removing them does simplify the table and the API. And the 8-bit C1 codes are rarely used and not well supported anyway. vte already doesn't appear to recognize 8-bit DCS, CSI, OSC, PM, or APC, Alacritty doesn't appear to intepret 8-bit NEL, SS2, or SS3, and the remaining 8-bit C1 codes are rarely used.

The UTF-8 parser knows how to handle invalid byte sequences, so don't
preprocess the input in the main parser; just hand any non-ASCII byte to
the UTF-8 parser to handle.

This includes what were previously interpreted as 8-bit C1 control codes;
they are now interpreted as UTF-8 continuation characters.
Copy link
Member

@chrisduerr chrisduerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly would we do this? We want execute to be called.

This is the first in a series of patches which together fix #38 and a few related issues, in total enabling vte to process non-UTF-8 bytes in a manner consistent with Rust's String::from_utf8_lossy.

If your goal is to implement something, could you please send the entire pull request to add this so the approach you've taken can be properly evaluated as a whole?

Unless the patch itself provides a direct value, there's really no reason to split them up. As far as I can tell, this patch just breaks things.

@sunfishcode
Copy link
Contributor Author

I've now posted #60 with the full series this is part of.

The motivation for this is that from a UTF-8 perspective, 8-bit C1 control codes are continuation bytes, and if they appear alone, they form invalid UTF-8 sequences. Passing them to execute allows users to emit replacement characters themselves, but it's a little awkward if vte translates some invalid sequences to replacement characters and not others. It also turns out to be tricky to emit the expected number of replacement characters from execute because vte is processing some of the bytes and not others. Since vte is emitting replacement characters for some invalid sequences, and has most of the logic needed to determine how many replacement characters to emit, the simplest approach is just to have vte emit the replacement characters.

@chrisduerr
Copy link
Member

Closing in favor of #60, I'll take a look once that is in a reviewable state.

@chrisduerr chrisduerr closed this Jun 15, 2020
@sunfishcode sunfishcode deleted the non-ascii-bytes branch June 15, 2020 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Handling invalid UTF-8 bytes
2 participants