Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

unhammer · 2025-02-17T13:03:26Z

Related to #167
lt-proc -we analysis currently gives up after a certain amount of states. But if we're doing case-insensitive matching, could it not fallback to trying a match on the lowercased word on reaching MAX_COMBINATIONS exceeded? That seems like it might catch the 90% case:

$ echo HJERTERYTMEOVERVÅKNING |lt-proc -we nob.automorf.bin
Warning: matching case-sensitively since processor state size >= 65536
Warning: compoundAnalysis's MAX_COMBINATIONS exceeded for 'HJERTERYTMEOVERVÅKNING'
         gave up at char 15 'V'.
^HJERTERYTMEOVERVÅKNING/*HJERTERYTMEOVERVÅKNING$

$ echo hjerterytmeovervåkning |lt-proc -we nob.automorf.bin
^hjerterytmeovervåkning/hjerterytmeovervåkning<n><m><sg><ind>/hjerterytmeovervåkning<n><f><sg><ind>$

$ echo HJERTEOVERVÅKNING |lt-proc -we nob.automorf.bin
Warning: matching case-sensitively since processor state size >= 65536
Warning: compoundAnalysis's MAX_COMBINATIONS exceeded for 'HJERTEOVERVÅKNING'
         gave up at char 15 'N'.
^HJERTEOVERVÅKNING/*HJERTEOVERVÅKNING$

$ echo hjerteovervåkning |lt-proc -we nob.automorf.bin
^hjerteovervåkning/hjerte<n><nt><sg><ind><cmp>+overvåkning<n><m><sg><ind>/hjerte<n><nt><sg><ind><cmp>+overvåkning<n><f><sg><ind>$

It might lead to incomplete analyses, if e.g. «HJERTE» was in the dictionary as an <np> then we miss out on /HJERTE<np><cmp>+overvåkning<n><m><sg><ind>, but I don't think it would lead to (otherwise) wrong analyses.

The text was updated successfully, but these errors were encountered:

ftyers · 2025-02-17T13:36:08Z

I could see perhaps an issue getting compounds of proper names that we wouldn't want. But it might be a good idea to implement it, apply it to a corpus and see what happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

unhammer commented Feb 17, 2025 •

edited

Loading

ftyers commented Feb 17, 2025

Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

Comments

unhammer commented Feb 17, 2025 • edited Loading

ftyers commented Feb 17, 2025

unhammer commented Feb 17, 2025 •

edited

Loading