Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

Closed
willdonnelly opened this issue Sep 16, 2024 · 0 comments · Fixed by #1979
Closed

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

willdonnelly opened this issue Sep 16, 2024 · 0 comments · Fixed by #1979
Assignees
Labels
change:planned This is a planned change

Comments

@willdonnelly
Copy link
Member

willdonnelly commented Sep 16, 2024

When a text column uses a non-Unicode charset (such as the latin1 default used by MySQL 5.7 and below), values are written into the binlog using that encoding and it's the responsibility of the reader to decode them appropriately. Currently our replication value decoding doesn't do that and we just assume the bytes are a valid UTF-8 string.

(Backfill queries appear to handle text with non-Unicode character sets just fine, for what it's worth, the bug here is solely with how we handle text values in binlog events)

We get away with this most of the time because the default character set in modern MySQL / MariaDB versions is a multi-byte UTF-8 encoding, and because in practice most text is ASCII. But when both of those assumptions are violated at the same time, even in a relatively central case like a MySQL 5.7 database with the most basic of accented Latinate characters (as in French/Spanish/etc text), we're pretty blatantly mangling the data.

Note that the capture of these columns isn't exactly "incorrect" in the usual way we use the word, as the text we capture is still a consistent representation of the original source data. The strings will just have all non-ASCII characters replaced with the U+FFFD REPLACEMENT CHARACTER and the ASCII code points captured correctly. The problem here is that this is basically never what the user actually wants and we have all the information we need to faithfully translate the non-ASCII code points to the appropriate Unicode equivalents.

We should improve on this by:

  • Obtaining the column character set or collation name for each text column during discovery.
  • Using this information to do a charset-aware []byte -> string conversion at the point where we currently just assume they're valid UTF-8 and cast to a string.
  • So long as we don't support the complete set of all MySQL charset / collation names, we should throw an error somewhere along the line when trying to capture a table with an unsupported collation rather than mistranslating the text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant