Skip to content

A class to process different kinds of character encodings appearing in video games and more

Notifications You must be signed in to change notification settings

CosmoCortney/MorphText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MorphText

A class to process and convert different kinds of texts/character encodings appearing in video games and elsewhere.

Encoding Identifiers

These are used to tell certain functions how to process input and output strings.

Supported String Types

UTF16LE, UTF16BE relate to std::wstring, const wchar_t*, and wchar_t* types.

UTF32LE, UTF32BE relate to std::u32string, const char32_t*, and char32_t* types.

All others relate to std::string, const char*, and char* types.

Constructors

MorphText()

Creates an empty instance.

MorphText(<String Type> str, const int encoding)

  • str: input string of any supported type
  • encoding: Encoding identifier of the input string

MorphText(MorphText& other)

Creates a copy of anoher instance

  • other: source instance

Operators

=

The left-hand instance becomes a copy of the right-hand one.

Conversions

outT Convert<inT, outT>(inT input, const int inputEncoding, const int outputEncoding)

Converts an input string of one encoding type to another.

  • Template Parameters:
    • inT: The type of the input string.
    • outT: The type of the output string.
  • Parameters
    • input: The input string of type inT
    • inputEncoding: Character encoding identifier of the input string
    • outputEncoding: Character encoding identifier of the output string
  • Returns: A string of type outT, encoded as outputEncoding.

Example:

std::wstring utf16le = MorphText::Convert<const char*, std::wstring>("an example", UTF8, UTF16LE);

Note: If you want to convert the assigned string of a MorphText instane, simply return it with the GetString() function.

inT ToLower(inT input, const int encoding)

Creates an all-lowercase copy of the input string.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • input: The input string of type inT
    • inputEncoding: Character encoding identifier of the input string
  • Returns: A string of type inT encoded as inputEncoding in all lowercase

Example:

std::string utf8 = MorphText::ToLower("Make Lowercase", UTF8);

inT ToUpper(inT input, const int encoding)

Creates an all-uppercase copy of the input string.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • input: The input string of type inT
    • inputEncoding: Character encoding identifier of the input string
  • Returns: A string of type inT encoded as inputEncoding in all uppercase

Example:

std::string utf8 = MorphText::ToUpper("make uppercase", UTF8);

inT ToSarcasm(inT input, const int encoding)

Creates a sarcastic copy of the input string.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • input: The input string of type inT
    • inputEncoding: Character encoding identifier of the input string
  • Returns: A string of type inT encoded as inputEncoding with sarcastic energy

Example:

std::string utf8 = MorphText::ToSarcasm("you shouldn't be using camelcase for your projects", UTF8);

bool Compare(inT lhs, inT rhs, const bool caseSensitive, const int encoding)

Compares two strings for equality.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • lhs: Left-hand side string
    • rhs: Right-hand side string
    • caseSensitive: whether to consider case sensitivity
    • encoding: Encoding identifier of the input strings
  • Returns: true if both strings are identcal, otherwise false.

Note

Comparing C-style strings might be faster

Example:

bool match = MorphText::Compare("test", "test", true, UTF8);

bool Compare(inT rhs, const bool caseSensitive, const int encoding)

Compares the instance against another string for equality.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • rhs: Right-hand side string
    • caseSensitive: whether to consider case sensitivity
    • encoding: Encoding identifier of the input strings
  • Returns: true if both strings are identcal, otherwise false.

Note

Comparing C-style strings might be faster.

Example:

bool match = MorphText::Compare("Test", false, ASCII);

int Find(intT superset, inT subset, const bool caseSensitive, const int encoding)

Finds the occurence of a subset string within a superset string.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • superset: String that may contain the substring
    • subset: Substring that may appear within the superset string
    • caseSensitive: whether to consider case sensitivity
    • encoding: Encoding identifier of the input strings
  • Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.

Note

Finding C-style strings might be faster.

Example:

int pos = MorphText::Find("where banana?", "banana", true, ASCII);

int Find(inT subset, const bool caseSensitive, const int encoding)

Finds the occurence of a subset string within the instance.

  • Data types:
    • inT: Any supported string type
  • Parameters
    • subset: Substring that may appear within the instance
    • caseSensitive: whether to consider case sensitivity
    • encoding: Encoding identifier of the input strings
  • Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.

Note

Finding C-style strings might be faster.

Example:

int pos = MorphText::Find("banana", false, ASCII);

T GetString(const int encoding)

Returns the instance's string by the desired encoding identifier.

  • Template Parameter
    • T: string type
  • Parameter
    • encoding: Encoding identifier of the output strings
  • Returns: The instance's string in the desired encoding and string type

Example:

MorphText test("ニコニコ二ー", UTF8);
test.GetString<std::string>(SHIFT_JIS);

SetString<T>(T input, const int encoding)

Sets the instance's string in the desired encoding identifier.

  • Datatype
    • T: string type
  • Parameter
    • encoding: Encoding identifier of the input strings
  • Returns: The instance's string in the desired encoding and string type

Example:

MorphText test;
test.SetString("ニコニコ二ー", UTF8);

SetPrimaryEncoding(const int encoding)

Sets the instance's string in the desired encoding identifier.

  • Parameter
    • encoding: Encoding identifier of the input strings

Print()

A test function that prints all class members. Only available in debug mode.

Test()

A test function that runs all functions. Only available in debug mode.

Using the DLL

C#

Required namespace: System.Runtime.InteropServices

The Following shows how to define all conversion functions in your C# class:

//char* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToCharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// char* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWcharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// char* to u32char_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWU32charStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);

// wchar_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToCharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// wchar_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToWcharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// wchar_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToU32charStringUnsafe(char[] input, int inputEncoding, int outputEncoding);

// char32_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToCharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

// char32_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Auto)]
private static extern IntPtr ConvertU32charStringToWcharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

// char32_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToU32charStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);

The following functions must be used to free the allocated memory of the converted strings (output strings)

//free C++ char*/C# byte[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryCharPtr(IntPtr ptr);

//free C++ wchar_t*/C# string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryWcharPtr(IntPtr ptr);

//free C++ char32_t*/C# UInt32[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryU32charPtr(IntPtr ptr);

Usage examples:

//signle-byte characters to C# string
Byte[] utf8 = new Byte[11] { 0x4d, 0x45, 0x4f, 0xc3, 0x96, 0xc3, 0x9c, 0xc3, 0x84, 0x57, 0x00 };
IntPtr resPtr = ConvertCharStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

//C# string to C# string
char[] utf16BE = new char[3] { 0x4700, 0x5300, 0x3300  };
IntPtr resPtr = ConvertWcharStringToWcharStringUnsafe(utf8, 3 /*utf16 big endian*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

//quatrouple-byte characters to C# string
UInt32[] utf8 = new UInt32[4] { 0x30d00000, 0xdf000000, 0x45f40100, 0 };
IntPtr resPtr = ConvertU32charStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);

ToDo

  • check if double-byte characters of Shift-Jis are stored in LE on LE machines
  • check if double-byte characters of KS X 1001 are stored in BE on BE machines and in LE on LE machines
  • public static C-String type conversion specialization (convertToUTF8, convertFromUTF8)
  • fix convertToUTF8(), convertFromUTF8(), Convert() to be able to use references of std::string, std::wstring, and std::u32string
  • Pokémon character encodings (Gen II and later + spin-offs)
  • add Shift-Jis CP10001/2000, Shift-Jis CP10001/2016
  • improve ToLower, ToUpper, ToSarcasm functions by specializing them and considering characters like umlauts, full-width letters, etc
  • improve comparisons by specializing them for each encoding
  • make member comparison overloads for c-style input string work
  • specialize findRaw() function for any other encoding than ASCII or any other UTF type to consider umlauts, fullwidth letters, etc for case insensitivity
  • test on a big-endian system
    • add necessary endianness checks to UTF-16 and UTF32-operations

Credits