Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU stuff #115

Merged
merged 56 commits into from
Jun 30, 2021
Merged

ICU stuff #115

merged 56 commits into from
Jun 30, 2021

Conversation

mr-martian
Copy link
Contributor

@mr-martian mr-martian commented May 22, 2021

ICU changes (closes #81)

  • replace all instances of std::wstring with UString (= std::basic_string<UChar>)
  • create InputFile wrapper to handle UTF-8 streams with nulls

efficiency, readability, and code style changes

  • eliminate Ltstr and string_to_wostream
  • simplify Makefile
  • make transducer symbols int32_t rather than int
  • make common symbols static attributes of Transducer
  • extract some other string constants
  • prefer std::vector to std::list
  • prefer .clear() and .empty() to = "" and == ""
  • prefer range-for loops
  • remove old lsx code
  • have regex_compiler iterate over the input string rather than modifying it
  • lift a static computation out of a loop in Transducer::determinize()
  • move constant initializers to class header

helper function and dependency changes

  • move StringUtils here from apertium
  • depend on external utfcpp rather than bundling it
  • make XMLParseUtil functions more specific to their typical usecases
  • add xml_walk_util.h for cleanly iterating over children of xmlNode*

@mr-martian mr-martian requested a review from TinoDidriksen May 22, 2021 23:47
@mr-martian
Copy link
Contributor Author

Do we actually need that whole m4 script? Can we just ask pkg-config about icu-io directly?

@TinoDidriksen
Copy link
Member

Do we actually need that whole m4 script? Can we just ask pkg-config about icu-io directly?

Agreed. Just do it like https://github.com/apertium/lexd/blob/master/configure.ac#L19-L20

And I see /home/daniel/lttoolbox/lttoolbox/nft.nrm in that diff.

@TinoDidriksen
Copy link
Member

Also, I don't think a normalization tool belongs in lttoolbox - that's something we probably want to adjust separately, so a repo of its own would be nice.

@TinoDidriksen
Copy link
Member

So far it looks like UnicodeString is suitable, but it will be interesting to see benchmarks. In CG-3 I use typedef std::basic_string<UChar> UString; for most strings, because it has a nicer interface and is movable.

@@ -113,20 +100,20 @@ class AttCompiler

Alphabet alphabet;
/** All non-multicharacter symbols. */
set<wchar_t> letters;
set<UChar> letters;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future optimization: flat_set or sorted_vector

lttoolbox/expander.cc Outdated Show resolved Hide resolved
lttoolbox/ustring.cc Outdated Show resolved Hide resolved
@TinoDidriksen
Copy link
Member

LGTM

lttoolbox/ustring.h Outdated Show resolved Hide resolved
@@ -179,7 +179,7 @@ Compiler::procAlphabet()
bool space = true;
for(unsigned int i = 0; i < letters.length(); i++)
{
if(!isspace(letters.at(i)))
if(!u_isspace(letters.at(i)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work: None of our codebases should have .at().

@mr-martian mr-martian marked this pull request as ready for review June 17, 2021 18:27
(really this is an error state either way, but I think this is
slightly more correct)
@mr-martian mr-martian merged commit 81de698 into master Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use ICU
2 participants