Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU stuff #115

Merged
merged 56 commits into from
Jun 30, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
3cda4e7
start making a normalization tool
mr-martian May 22, 2021
c9170d0
move normalization to a different repo, simpler ICU check
mr-martian May 26, 2021
dda9502
missed a line
mr-martian May 26, 2021
7a38b4e
makefile cleanup
mr-martian May 26, 2021
ad15367
the long march part 1
mr-martian May 27, 2021
44ef72a
the long march part 2
mr-martian Jun 1, 2021
b99672a
the long march part 3 (compiles, but tests fail)
mr-martian Jun 1, 2021
c7208ea
lt-comp seems to be working now
mr-martian Jun 1, 2021
2e9fac4
lt-print seems to be working
mr-martian Jun 1, 2021
0624071
lt-proc (unweighted) working
mr-martian Jun 2, 2021
3a6ab4d
all tests now pass
mr-martian Jun 2, 2021
ab3ecd0
add a non-BMP test
mr-martian Jun 2, 2021
1679bb5
add the file used in the test :p
mr-martian Jun 2, 2021
ac7867f
use utf-32 sometimes and some type cleanup
mr-martian Jun 3, 2021
b5d6e07
utf-32 in monodix and some type cleanup
mr-martian Jun 3, 2021
0eb748f
cleverness is to be avoided (investigating #85)
mr-martian Jun 3, 2021
3e4bc43
yet more type cleanup
mr-martian Jun 3, 2021
257d33c
finish eliminating wchar and make more use of helper functions
mr-martian Jun 3, 2021
eee5c37
no more need for windows compatibility header
mr-martian Jun 3, 2021
4bd6719
get python bindings to compile
mr-martian Jun 3, 2021
3f8cbb3
see if we can get the Travis tests working
mr-martian Jun 3, 2021
db59366
eliminate use of wide streams
mr-martian Jun 4, 2021
3a293af
extracting string constants
mr-martian Jun 4, 2021
d6c16f9
don't need the whole converter for 1 codepoint
mr-martian Jun 4, 2021
397a7f2
drop unused helpers, add copywrite headers, use _unlocked everywhere
mr-martian Jun 4, 2021
6ee59d5
typo
mr-martian Jun 4, 2021
8fca95e
blah
mr-martian Jun 4, 2021
e14b45b
ok fine, I'll put the tab back
mr-martian Jun 4, 2021
feb2f39
missed some bad casts
mr-martian Jun 4, 2021
2d2abd8
typo
mr-martian Jun 4, 2021
ce3eb90
my continuing battle with indentation and yaml
mr-martian Jun 4, 2021
8a7a0fc
.gitignore cleanup and darn it yaml!
mr-martian Jun 4, 2021
e317a78
assorted nits
mr-martian Jun 4, 2021
dd11193
helper functions for use in apertium
mr-martian Jun 6, 2021
43e225b
more helper stuff
mr-martian Jun 7, 2021
51b0651
add << UChar for newer g++ and switch << back to std::ostream
mr-martian Jun 10, 2021
d49b84b
another helper (not symmetric - should probably fix that)
mr-martian Jun 11, 2021
f7d6a4b
unbundle utfcpp
mr-martian Jun 11, 2021
e30103c
fix tests?
mr-martian Jun 11, 2021
76d0bd1
typo in package name
mr-martian Jun 11, 2021
7d9c359
try again
mr-martian Jun 11, 2021
5360b40
it helps to edit the right test file
mr-martian Jun 11, 2021
210c645
another helper (rather than define this in every repo)
mr-martian Jun 11, 2021
62bb1df
move string_utils into lttoolbox and casehandle better
mr-martian Jun 14, 2021
8d1b620
add caseless compare helper to replace (tolower(a) == tolower(b))
mr-martian Jun 14, 2021
3c17f4c
move xml iterator to lttoolbox
mr-martian Jun 14, 2021
b342dcb
another helper
mr-martian Jun 14, 2021
b86d375
Merge branch 'master' into icu
mr-martian Jun 15, 2021
129e45b
typo in merge
mr-martian Jun 15, 2021
04d5a87
incorporate optimizations from #114
mr-martian Jun 15, 2021
d423e9a
small bugs
mr-martian Jun 15, 2021
9b075d6
move constant initializers to header and make more use of helpers
mr-martian Jun 16, 2021
2cc8be3
make to_ustring() use unsigned chars
mr-martian Jun 17, 2021
a2acea4
version bump
mr-martian Jun 17, 2021
5bd42f2
final elimination of wide strings
mr-martian Jun 17, 2021
96bab35
InputFile block reading should respect null flush
mr-martian Jun 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,5 @@
*.egg-info/
*.egg
**/.mypy_cache/

mr-martian marked this conversation as resolved.
Show resolved Hide resolved
*~
3 changes: 2 additions & 1 deletion configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ AC_ARG_ENABLE(profile,
[CXXFLAGS="-pg -g -Wall"; CFLAGS="-pg -g -Wall"; LDFLAGS="-pg"])


PKG_CHECK_MODULES(LTTOOLBOX, [libxml-2.0 >= 2.6.17])
PKG_CHECK_MODULES(LIBXML, [libxml-2.0 >= 2.6.17])
PKG_CHECK_MODULES(ICU, [icu-i18n, icu-io, icu-uc])

# Check for wide strings
AC_DEFUN([AC_CXX_WSTRING],[
Expand Down
28 changes: 6 additions & 22 deletions lttoolbox/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@ h_sources = alphabet.h att_compiler.h buffer.h compiler.h compression.h \
deserialiser.h entry_token.h expander.h fst_processor.h lt_locale.h \
ltstr.h match_exe.h match_node.h match_state.h my_stdio.h node.h \
pattern_list.h regexp_compiler.h serialiser.h sorted_vector.h state.h \
string_utils.h \
transducer.h trans_exe.h xml_parse_util.h exception.h tmx_compiler.h \
string_to_wostream.h
cc_sources = alphabet.cc att_compiler.cc compiler.cc compression.cc entry_token.cc \
expander.cc fst_processor.cc lt_locale.cc match_exe.cc \
match_node.cc match_state.cc node.cc pattern_list.cc \
regexp_compiler.cc sorted_vector.cc state.cc transducer.cc \
regexp_compiler.cc sorted_vector.cc state.cc string_utils.cc transducer.cc \
trans_exe.cc xml_parse_util.cc tmx_compiler.cc

library_includedir = $(includedir)/$(PACKAGE_NAME)-$(VERSION_API)/$(PACKAGE_NAME)
Expand All @@ -27,33 +28,16 @@ lttoolboxlib = $(prefix)/lib

lttoolbox_DATA = dix.dtd dix.rng dix.rnc acx.rng xsd/dix.xsd xsd/acx.xsd

lt_print_SOURCES = lt_print.cc
lt_print_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_print_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)
LDADD = liblttoolbox$(VERSION_MAJOR).la
AM_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LIBXML_LIBS) $(ICU_LIBS)

lt_print_SOURCES = lt_print.cc
lt_trim_SOURCES = lt_trim.cc
lt_trim_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_trim_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_comp_SOURCES = lt_comp.cc
lt_comp_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_comp_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_proc_SOURCES = lt_proc.cc
lt_proc_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_proc_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_expand_SOURCES = lt_expand.cc
lt_expand_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_expand_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_tmxcomp_SOURCES = lt_tmxcomp.cc
lt_tmxcomp_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_tmxcomp_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_tmxproc_SOURCES = lt_tmxproc.cc
lt_tmxproc_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_tmxproc_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

#lt-validate-dictionary: Makefile.am validate-header.sh
# @echo "Creating lt-validate-dictionary script"
Expand All @@ -67,7 +51,7 @@ lt_tmxproc_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

man_MANS = lt-comp.1 lt-expand.1 lt-proc.1 lt-tmxcomp.1 lt-tmxproc.1 lt-print.1 lt-trim.1

INCLUDES = -I$(top_srcdir) $(LTTOOLBOX_CFLAGS)
INCLUDES = -I$(top_srcdir) $(LIBXML_CFLAGS) $(ICU_CFLAGS)
if WINDOWS
INCLUDES += -I$(top_srcdir)/utf8
endif
Expand Down
37 changes: 18 additions & 19 deletions lttoolbox/alphabet.cc
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,10 @@
#include <cwchar>
#include <cwctype>

#if defined(_WIN32) && !defined(_MSC_VER)
#include <utf8_fwrap.h>
#endif
#include "string_utils.h"

using namespace std;
using namespace icu;

Alphabet::Alphabet()
{
Expand Down Expand Up @@ -74,7 +73,7 @@ Alphabet::copy(Alphabet const &a)
}

void
Alphabet::includeSymbol(wstring const &s)
Alphabet::includeSymbol(UnicodeString const &s)
{
if(slexic.find(s) == slexic.end())
{
Expand All @@ -99,13 +98,13 @@ Alphabet::operator()(int const c1, int const c2)
}

int
Alphabet::operator()(wstring const &s)
Alphabet::operator()(UnicodeString const &s)
{
return slexic[s];
}

int
Alphabet::operator()(wstring const &s) const
Alphabet::operator()(UnicodeString const &s) const
{
auto it = slexic.find(s);
if (it == slexic.end()) {
Expand All @@ -115,7 +114,7 @@ Alphabet::operator()(wstring const &s) const
}

bool
Alphabet::isSymbolDefined(wstring const &s)
Alphabet::isSymbolDefined(UnicodeString const &s)
{
return slexic.find(s) != slexic.end();
}
Expand All @@ -133,7 +132,7 @@ Alphabet::write(FILE *output)
Compression::multibyte_write(slexicinv.size(), output); // taglist size
for(unsigned int i = 0, limit = slexicinv.size(); i < limit; i++)
{
Compression::wstring_write(slexicinv[i].substr(1, slexicinv[i].size()-2), output);
Compression::string_write(slexicinv[i].tempSubString(1, slexicinv[i].length()-2), output);
}

// Then we write the list of pairs
Expand All @@ -160,7 +159,7 @@ Alphabet::read(FILE *input)
while(tam > 0)
{
tam--;
wstring mytag = L"<" + Compression::wstring_read(input) + L">";
UnicodeString mytag = "<" + Compression::string_read(input) + ">";
a_new.slexicinv.push_back(mytag);
a_new.slexic[mytag]= -a_new.slexicinv.size(); // ToDo: This does not turn the result negative due to unsigned semantics
}
Expand All @@ -185,7 +184,7 @@ Alphabet::read(FILE *input)
void
Alphabet::serialise(std::ostream &serialised) const
{
Serialiser<const vector<wstring> >::serialise(slexicinv, serialised);
Serialiser<const vector<UnicodeString> >::serialise(slexicinv, serialised);
Serialiser<vector<pair<int, int> > >::serialise(spairinv, serialised);
}

Expand All @@ -196,7 +195,7 @@ Alphabet::deserialise(std::istream &serialised)
slexic.clear();
spairinv.clear();
spair.clear();
slexicinv = Deserialiser<vector<wstring> >::deserialise(serialised);
slexicinv = Deserialiser<vector<UnicodeString> >::deserialise(serialised);
for (size_t i = 0; i < slexicinv.size(); i++) {
slexic[slexicinv[i]] = -i - 1; // ToDo: This does not turn the result negative due to unsigned semantics
}
Expand All @@ -207,20 +206,20 @@ Alphabet::deserialise(std::istream &serialised)
}

void
Alphabet::writeSymbol(int const symbol, FILE *output) const
Alphabet::writeSymbol(int const symbol, UFILE *output) const
{
if(symbol < 0)
{
fputws_unlocked(slexicinv[-symbol-1].c_str(), output);
u_fputs(slexicinv[-symbol-1], output);
mr-martian marked this conversation as resolved.
Show resolved Hide resolved
}
else
{
fputwc_unlocked(static_cast<wchar_t>(symbol), output);
u_fputc(static_cast<UChar>(symbol), output);
}
}

void
Alphabet::getSymbol(wstring &result, int const symbol, bool uppercase) const
Alphabet::getSymbol(UnicodeString &result, int const symbol, bool uppercase) const
{
if(symbol == 0)
{
Expand All @@ -231,7 +230,7 @@ Alphabet::getSymbol(wstring &result, int const symbol, bool uppercase) const
{
if(symbol >= 0)
{
result += static_cast<wchar_t>(symbol);
result += static_cast<UChar>(symbol);
}
else
{
Expand All @@ -240,7 +239,7 @@ Alphabet::getSymbol(wstring &result, int const symbol, bool uppercase) const
}
else if(symbol >= 0)
{
result += static_cast<wchar_t>(towupper(static_cast<wint_t>(symbol)));
result += static_cast<UChar>(toupper(static_cast<wint_t>(symbol)));
mr-martian marked this conversation as resolved.
Show resolved Hide resolved
}
else
{
Expand All @@ -261,7 +260,7 @@ Alphabet::decode(int const code) const
}

set<int>
Alphabet::symbolsWhereLeftIs(wchar_t l) const {
Alphabet::symbolsWhereLeftIs(UChar l) const {
mr-martian marked this conversation as resolved.
Show resolved Hide resolved
set<int> eps;
for(const auto& sp: spair) { // [(l, r) : tag]
if(sp.first.first == l) {
Expand All @@ -271,7 +270,7 @@ Alphabet::symbolsWhereLeftIs(wchar_t l) const {
return eps;
}

void Alphabet::setSymbol(int symbol, wstring newSymbolString) {
void Alphabet::setSymbol(int symbol, UnicodeString newSymbolString) {
//Should be a special character!
if (symbol < 0) slexicinv[-symbol-1] = newSymbolString;
}
Expand Down
25 changes: 13 additions & 12 deletions lttoolbox/alphabet.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@
#include <map>
#include <set>
#include <vector>

#include <lttoolbox/ltstr.h>
#include <unicode/unistr.h>
#include <unicode/ustdio.h>

using namespace std;
using namespace icu;

/**
* Alphabet class.
Expand All @@ -38,13 +39,13 @@ class Alphabet
* Symbol-identifier relationship. Only contains <tags>.
* @see slexicinv
*/
map<wstring, int, Ltstr> slexic;
map<UnicodeString, int> slexic;

/**
* Identifier-symbol relationship. Only contains <tags>.
* @see slexic
*/
vector<wstring> slexicinv;
vector<UnicodeString> slexicinv;


/**
Expand Down Expand Up @@ -89,7 +90,7 @@ class Alphabet
/**
* Include a symbol into the alphabet.
*/
void includeSymbol(wstring const &s);
void includeSymbol(UnicodeString const &s);

/**
* Get an unique code for every symbol pair. This flavour is for
Expand All @@ -99,22 +100,22 @@ class Alphabet
* @return code for (c1, c2).
*/
int operator()(int const c1, int const c2);
int operator()(wstring const &s) const;
int operator()(UnicodeString const &s) const;

/**
* Gets the individual symbol identifier. Assumes it already exists!
* @see isSymbolDefined to check if it exists first.
* @param s symbol to be identified.
* @return symbol identifier.
*/
int operator()(wstring const &s);
int operator()(UnicodeString const &s);

/**
* Check wether the symbol is defined in the alphabet.
* @param s symbol
* @return true if defined
*/
bool isSymbolDefined(wstring const &s);
bool isSymbolDefined(UnicodeString const &s);

/**
* Returns the size of the alphabet (number of symbols).
Expand Down Expand Up @@ -142,15 +143,15 @@ class Alphabet
* @param symbol symbol code.
* @param output output stream.
*/
void writeSymbol(int const symbol, FILE *output) const;
void writeSymbol(int const symbol, UFILE *output) const;

/**
* Concat a symbol in the string that is passed by reference.
* @param result string where the symbol should be concatenated
* @param symbol code of the symbol
* @param uppercase true if we want an uppercase symbol
*/
void getSymbol(wstring &result, int const symbol,
void getSymbol(UnicodeString &result, int const symbol,
bool uppercase = false) const;

/**
Expand All @@ -165,7 +166,7 @@ class Alphabet
* @param symbol the code of the symbol to set
* @param newSymbolString the new string for this symbol
*/
void setSymbol(int symbol, wstring newSymbolString);
void setSymbol(int symbol, UnicodeString newSymbolString);

/**
* Note: both the symbol int and int-pair are specific to this alphabet instance.
Expand All @@ -178,7 +179,7 @@ class Alphabet
/**
* Get all symbols where the left-hand side of the symbol-pair is l.
*/
set<int> symbolsWhereLeftIs(wchar_t l) const;
set<int> symbolsWhereLeftIs(UChar l) const;

enum Side
{
Expand Down
Loading