handling both combined and non-combined characters equivalently #2

unhammer · 2021-04-14T12:53:03Z

$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE,
the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: k<par n="ũ"/>un<par n="kũun/i__n"/>
use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
change lttoolbox to treat them equivalently – big job, but everyone wins

@fatkab @ftyers thoughts?

The text was updated successfully, but these errors were encountered:

fatkab · 2021-04-14T13:06:41Z

Hi! Thank you so much for your suggestions. It becomes challenging.For me to be sincere my technical skills are limited and I don't really know what I can do. Best, Fatouma

…

On Wed, Apr 14, 2021 at 2:54 PM Kevin Brubeck Unhammer < ***@***.***> wrote: $ echo "kũuni kũuni" | apertium -d . mos-morph ^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$ The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE. The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one. .acx doesn't help here since it's two codepoints. Possible solutions: - use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: k<par n="ũ"/>un<par n="kũun/i__n"/> - use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how - change lttoolbox to treat them equivalently – big job, but everyone wins @fatkab <https://github.com/fatkab> @ftyers <https://github.com/ftyers> thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJNVJ6Y3R6XR52EEGJQ3FTTIWGBNANCNFSM425KVS2A> .

flammie · 2021-04-14T13:11:37Z

use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how

You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around

change lttoolbox to treat them equivalently – big job, but everyone wins

apertium/organisation#24

ftyers · 2021-04-14T14:19:15Z

The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa.

mr-martian · 2021-08-14T16:10:58Z

As a stop-gap measure, I've added a normalizing morph mode mos-nmorph in 73cc4b4 that uses uconv -x any-nfc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling both combined and non-combined characters equivalently #2

handling both combined and non-combined characters equivalently #2

unhammer commented Apr 14, 2021 •

edited

Loading

fatkab commented Apr 14, 2021 via email

flammie commented Apr 14, 2021

ftyers commented Apr 14, 2021

mr-martian commented Aug 14, 2021

handling both combined and non-combined characters equivalently #2

handling both combined and non-combined characters equivalently #2

Comments

unhammer commented Apr 14, 2021 • edited Loading

fatkab commented Apr 14, 2021 via email

flammie commented Apr 14, 2021

ftyers commented Apr 14, 2021

mr-martian commented Aug 14, 2021

unhammer commented Apr 14, 2021 •

edited

Loading