MorphText

A class to process and convert different kinds of texts/character encodings appearing in video games and elsewhere.

Encoding Identifiers

These are used to tell certain functions how to process input and output strings.

PRIMARY: Processes operation according to the previously set primary encoding
UTF8: UTF-8
UTF16LE: UTF-16 in Little Endian
UTF16BE: UTF-16 in Big Endian
UTF32LE: UTF-32 in Little Endian
UTF32BE: UTF-32 in Big Endian
ASCII: ASCII
ISO_8859_1: ISO-8859-1. Available aliases: LATIN1
ISO_8859_2: ISO-8859-2. Available aliases: LATIN2
ISO_8859_3: ISO-8859-3. Available aliases: LATIN3
ISO_8859_4: ISO-8859-4. Available aliases: LATIN4
ISO_8859_5: ISO-8859-5. Available aliases: CYRILLIC
ISO_8859_6: ISO-8859-6. Available aliases: ARABIC
ISO_8859_7: ISO-8859-7. Available aliases: GREEK
ISO_8859_8: ISO-8859-8. Available aliases: HEBREW
ISO_8859_9: ISO-8859-9. Available aliases: TURKISH, LATIN5
ISO_8859_10: ISO-8859-10. Available aliases: NORDIC, LATIN6
ISO_8859_11: ISO-8859-11. Available aliases: THAI
ISO_8859_13: ISO-8859-13. Available aliases: BALTIC, LATIN7
ISO_8859_14: ISO-8859-14. Available aliases: CELTIC, LATIN8
ISO_8859_15: ISO-8859-15. Available aliases: WEST_EUROPEAN, LATIN9
ISO_8859_16: ISO-8859-16. Available aliases: SOUTHEAST_EUROPEAN, LATIN10
SHIFTJIS_CP932: Shift Jis Code Page 932. Available aliases: CP932, SHIFT_JIS_CP932, SJIS932, MS932
JIS_X_0201_FULLWIDTH: JIS X 0201 in Full Width Katakana
JIS_X_0201_HALFWIDTH: JIS X 0201 in Half Width Katakana
KS_X_1001: KS X 1001. Available aliases: EUC_KR, KS_C_5601
POKEMON_GEN1_ENGLISH: Pokémon Gen I English
POKEMON_GEN1_FRENCH_GERMAN: Pokémon Gen I French & German
POKEMON_GEN1_ITALIAN_SPANISH: Pokémon Gen I Italian & Spanish
POKEMON_GEN1_JAPANESE: Pokémon Gen I Japanese
POKEMON_GEN2_ENGLISH: Pokémon Gen II English

Supported String Types

UTF16LE, UTF16BE relate to std::wstring, const wchar_t*, and wchar_t* types.

UTF32LE, UTF32BE relate to std::u32string, const char32_t*, and char32_t* types.

All others relate to std::string, const char*, and char* types.

Constructors

`MorphText()`

Creates an empty instance.

`MorphText(<String Type> str, const int encoding)`

str: input string of any supported type
encoding: Encoding identifier of the input string

`MorphText(MorphText& other)`

Creates a copy of anoher instance

other: source instance

Operators

=

The left-hand instance becomes a copy of the right-hand one.

Conversions

`outT Convert<inT, outT>(inT input, const int inputEncoding, const int outputEncoding)`

Converts an input string of one encoding type to another.

Template Parameters:
- inT: The type of the input string.
- outT: The type of the output string.
Parameters
- input: The input string of type inT
- inputEncoding: Character encoding identifier of the input string
- outputEncoding: Character encoding identifier of the output string
Returns: A string of type outT, encoded as outputEncoding.

Example:

std::wstring utf16le = MorphText::Convert<const char*, std::wstring>("an example", UTF8, UTF16LE);

Note: If you want to convert the assigned string of a MorphText instane, simply return it with the GetString() function.

`inT ToLower(inT input, const int encoding)`

Creates an all-lowercase copy of the input string.

Data types:
- inT: Any supported string type
Parameters
- input: The input string of type inT
- inputEncoding: Character encoding identifier of the input string
Returns: A string of type inT encoded as inputEncoding in all lowercase

Example:

std::string utf8 = MorphText::ToLower("Make Lowercase", UTF8);

`inT ToUpper(inT input, const int encoding)`

Creates an all-uppercase copy of the input string.

Data types:
- inT: Any supported string type
Parameters
- input: The input string of type inT
- inputEncoding: Character encoding identifier of the input string
Returns: A string of type inT encoded as inputEncoding in all uppercase

Example:

std::string utf8 = MorphText::ToUpper("make uppercase", UTF8);

`inT ToSarcasm(inT input, const int encoding)`

Creates a sarcastic copy of the input string.

Data types:
- inT: Any supported string type
Parameters
- input: The input string of type inT
- inputEncoding: Character encoding identifier of the input string
Returns: A string of type inT encoded as inputEncoding with sarcastic energy

Example:

std::string utf8 = MorphText::ToSarcasm("you shouldn't be using camelcase for your projects", UTF8);

`bool Compare(inT lhs, inT rhs, const bool caseSensitive, const int encoding)`

Compares two strings for equality.

Data types:
- inT: Any supported string type
Parameters
- lhs: Left-hand side string
- rhs: Right-hand side string
- caseSensitive: whether to consider case sensitivity
- encoding: Encoding identifier of the input strings
Returns: true if both strings are identcal, otherwise false.

Note

Comparing C-style strings might be faster

Example:

bool match = MorphText::Compare("test", "test", true, UTF8);

`bool Compare(inT rhs, const bool caseSensitive, const int encoding)`

Compares the instance against another string for equality.

Data types:
- inT: Any supported string type
Parameters
- rhs: Right-hand side string
- caseSensitive: whether to consider case sensitivity
- encoding: Encoding identifier of the input strings
Returns: true if both strings are identcal, otherwise false.

Note

Comparing C-style strings might be faster.

Example:

bool match = MorphText::Compare("Test", false, ASCII);

`int Find(intT superset, inT subset, const bool caseSensitive, const int encoding)`

Finds the occurence of a subset string within a superset string.

Data types:
- inT: Any supported string type
Parameters
- superset: String that may contain the substring
- subset: Substring that may appear within the superset string
- caseSensitive: whether to consider case sensitivity
- encoding: Encoding identifier of the input strings
Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.

Note

Finding C-style strings might be faster.

Example:

int pos = MorphText::Find("where banana?", "banana", true, ASCII);

`int Find(inT subset, const bool caseSensitive, const int encoding)`

Finds the occurence of a subset string within the instance.

Data types:
- inT: Any supported string type
Parameters
- subset: Substring that may appear within the instance
- caseSensitive: whether to consider case sensitivity
- encoding: Encoding identifier of the input strings
Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.

Note

Finding C-style strings might be faster.

Example:

int pos = MorphText::Find("banana", false, ASCII);

`T GetString(const int encoding)`

Returns the instance's string by the desired encoding identifier.

Template Parameter
- T: string type
Parameter
- encoding: Encoding identifier of the output strings
Returns: The instance's string in the desired encoding and string type

Example:

MorphText test("ニコニコ二ー", UTF8);
test.GetString<std::string>(SHIFT_JIS);

`SetString<T>(T input, const int encoding)`

Sets the instance's string in the desired encoding identifier.

Datatype
- T: string type
Parameter
- encoding: Encoding identifier of the input strings
Returns: The instance's string in the desired encoding and string type

Example:

MorphText test;
test.SetString("ニコニコ二ー", UTF8);

`SetPrimaryEncoding(const int encoding)`

Sets the instance's string in the desired encoding identifier.

Parameter
- encoding: Encoding identifier of the input strings

`Print()`

A test function that prints all class members. Only available in debug mode.

`Test()`

A test function that runs all functions. Only available in debug mode.

Using the DLL

C#

Required namespace: System.Runtime.InteropServices

The Following shows how to define all conversion functions in your C# class:

//char* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToCharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// char* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWcharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// char* to u32char_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWU32charStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// wchar_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToCharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// wchar_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToWcharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// wchar_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToU32charStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// char32_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToCharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

// char32_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Auto)]
private static extern IntPtr ConvertU32charStringToWcharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

// char32_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToU32charStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

The following functions must be used to free the allocated memory of the converted strings (output strings)

//free C++ char*/C# byte[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryCharPtr(IntPtr ptr);

//free C++ wchar_t*/C# string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryWcharPtr(IntPtr ptr);

//free C++ char32_t*/C# UInt32[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryU32charPtr(IntPtr ptr);

Usage examples:

//signle-byte characters to C# string
Byte[] utf8 = new Byte[11] { 0x4d, 0x45, 0x4f, 0xc3, 0x96, 0xc3, 0x9c, 0xc3, 0x84, 0x57, 0x00 };
IntPtr resPtr = ConvertCharStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

//C# string to C# string
char[] utf16BE = new char[3] { 0x4700, 0x5300, 0x3300  };
IntPtr resPtr = ConvertWcharStringToWcharStringUnsafe(utf8, 3 /*utf16 big endian*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

//quatrouple-byte characters to C# string
UInt32[] utf8 = new UInt32[4] { 0x30d00000, 0xdf000000, 0x45f40100, 0 };
IntPtr resPtr = ConvertU32charStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

ToDo

check if double-byte characters of Shift-Jis are stored in LE on LE machines
check if double-byte characters of KS X 1001 are stored in BE on BE machines and in LE on LE machines
public static C-String type conversion specialization (convertToUTF8, convertFromUTF8)
fix convertToUTF8(), convertFromUTF8(), Convert() to be able to use references of std::string, std::wstring, and std::u32string
Pokémon character encodings (Gen II and later + spin-offs)
add Shift-Jis CP10001/2000, Shift-Jis CP10001/2016
improve ToLower, ToUpper, ToSarcasm functions by specializing them and considering characters like umlauts, full-width letters, etc
improve comparisons by specializing them for each encoding
make member comparison overloads for c-style input string work
specialize findRaw() function for any other encoding than ASCII or any other UTF type to consider umlauts, fullwidth letters, etc for case insensitivity
test on a big-endian system
- add necessary endianness checks to UTF-16 and UTF32-operations

Credits

Lawn Meower: Idea, Code
sozysozbot: original KS X 1001 table
Bulbapedia wiki at Bulbagarden: Documenting the Pokémon character encodings

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
MorphText		MorphText
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakeSettings.json		CMakeSettings.json
README.md		README.md
config.h.in		config.h.in

CosmoCortney/MorphText

Folders and files

Latest commit

History

Repository files navigation

MorphText

Encoding Identifiers

Supported String Types

Constructors

MorphText()

MorphText(<String Type> str, const int encoding)

MorphText(MorphText& other)

Operators

=

Conversions

outT Convert<inT, outT>(inT input, const int inputEncoding, const int outputEncoding)

Example:

inT ToLower(inT input, const int encoding)

Example:

inT ToUpper(inT input, const int encoding)

Example:

inT ToSarcasm(inT input, const int encoding)

Example:

bool Compare(inT lhs, inT rhs, const bool caseSensitive, const int encoding)

Note

Example:

bool Compare(inT rhs, const bool caseSensitive, const int encoding)

Note

Example:

int Find(intT superset, inT subset, const bool caseSensitive, const int encoding)

Note

Example:

int Find(inT subset, const bool caseSensitive, const int encoding)

Note

Example:

T GetString(const int encoding)

Example:

SetString<T>(T input, const int encoding)

Example:

SetPrimaryEncoding(const int encoding)

Print()

Test()

Using the DLL

C#

ToDo

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`MorphText()`

`MorphText(<String Type> str, const int encoding)`

`MorphText(MorphText& other)`

`outT Convert<inT, outT>(inT input, const int inputEncoding, const int outputEncoding)`

`inT ToLower(inT input, const int encoding)`

`inT ToUpper(inT input, const int encoding)`

`inT ToSarcasm(inT input, const int encoding)`

`bool Compare(inT lhs, inT rhs, const bool caseSensitive, const int encoding)`

`bool Compare(inT rhs, const bool caseSensitive, const int encoding)`

`int Find(intT superset, inT subset, const bool caseSensitive, const int encoding)`

`int Find(inT subset, const bool caseSensitive, const int encoding)`

`T GetString(const int encoding)`

`SetString<T>(T input, const int encoding)`

`SetPrimaryEncoding(const int encoding)`

`Print()`

`Test()`

Packages