-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling both combined and non-combined characters equivalently #2
Comments
Hi!
Thank you so much for your suggestions. It becomes challenging.For me to be
sincere my technical skills are limited and I don't really know what I can
do.
Best,
Fatouma
…On Wed, Apr 14, 2021 at 2:54 PM Kevin Brubeck Unhammer < ***@***.***> wrote:
$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$
The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE,
the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303
COMBINING TILDE.
The .dix file has an entry for the single-codepoint version, so we get an
analysis for only that one.
.acx doesn't help here since it's two codepoints.
Possible solutions:
- use a pardef for every single tilde-entry in the .dix file – simple,
but very ugly: <i>k</i><par n="ũ"/><i>un</i><par n="kũun/i__n"/>
- use some hfst-trickery to do basically the same thing on compile –
slightly more complicated, but a one-time job for someone who knows how
- change lttoolbox to treat them equivalently – big job, but everyone
wins
@fatkab <https://github.com/fatkab> @ftyers <https://github.com/ftyers>
thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALJNVJ6Y3R6XR52EEGJQ3FTTIWGBNANCNFSM425KVS2A>
.
|
You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around
|
The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa. |
As a stop-gap measure, I've added a normalizing morph mode |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE,
the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.
The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.
.acx doesn't help here since it's two codepoints.
Possible solutions:
<i>k</i><par n="ũ"/><i>un</i><par n="kũun/i__n"/>
@fatkab @ftyers thoughts?
The text was updated successfully, but these errors were encountered: