You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a text column uses a non-Unicode charset (such as the latin1 default used by MySQL 5.7 and below), values are written into the binlog using that encoding and it's the responsibility of the reader to decode them appropriately. Currently our replication value decoding doesn't do that and we just assume the bytes are a valid UTF-8 string.
(Backfill queries appear to handle text with non-Unicode character sets just fine, for what it's worth, the bug here is solely with how we handle text values in binlog events)
We get away with this most of the time because the default character set in modern MySQL / MariaDB versions is a multi-byte UTF-8 encoding, and because in practice most text is ASCII. But when both of those assumptions are violated at the same time, even in a relatively central case like a MySQL 5.7 database with the most basic of accented Latinate characters (as in French/Spanish/etc text), we're pretty blatantly mangling the data.
Note that the capture of these columns isn't exactly "incorrect" in the usual way we use the word, as the text we capture is still a consistent representation of the original source data. The strings will just have all non-ASCII characters replaced with the U+FFFD REPLACEMENT CHARACTER and the ASCII code points captured correctly. The problem here is that this is basically never what the user actually wants and we have all the information we need to faithfully translate the non-ASCII code points to the appropriate Unicode equivalents.
We should improve on this by:
Obtaining the column character set or collation name for each text column during discovery.
Using this information to do a charset-aware []byte -> string conversion at the point where we currently just assume they're valid UTF-8 and cast to a string.
So long as we don't support the complete set of all MySQL charset / collation names, we should throw an error somewhere along the line when trying to capture a table with an unsupported collation rather than mistranslating the text.
The text was updated successfully, but these errors were encountered:
When a text column uses a non-Unicode charset (such as the
latin1
default used by MySQL 5.7 and below), values are written into the binlog using that encoding and it's the responsibility of the reader to decode them appropriately. Currently our replication value decoding doesn't do that and we just assume the bytes are a valid UTF-8 string.(Backfill queries appear to handle text with non-Unicode character sets just fine, for what it's worth, the bug here is solely with how we handle text values in binlog events)
We get away with this most of the time because the default character set in modern MySQL / MariaDB versions is a multi-byte UTF-8 encoding, and because in practice most text is ASCII. But when both of those assumptions are violated at the same time, even in a relatively central case like a MySQL 5.7 database with the most basic of accented Latinate characters (as in French/Spanish/etc text), we're pretty blatantly mangling the data.
Note that the capture of these columns isn't exactly "incorrect" in the usual way we use the word, as the text we capture is still a consistent representation of the original source data. The strings will just have all non-ASCII characters replaced with the
U+FFFD REPLACEMENT CHARACTER
and the ASCII code points captured correctly. The problem here is that this is basically never what the user actually wants and we have all the information we need to faithfully translate the non-ASCII code points to the appropriate Unicode equivalents.We should improve on this by:
[]byte -> string
conversion at the point where we currently just assume they're valid UTF-8 and cast to a string.The text was updated successfully, but these errors were encountered: