Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we fallback to lowercased analysis on MAX_COMBINATIONS exceeded? #194

Open
unhammer opened this issue Feb 17, 2025 · 1 comment
Open

Comments

@unhammer
Copy link
Member

unhammer commented Feb 17, 2025

Related to #167
lt-proc -we analysis currently gives up after a certain amount of states. But if we're doing case-insensitive matching, could it not fallback to trying a match on the lowercased word on reaching MAX_COMBINATIONS exceeded? That seems like it might catch the 90% case:

$ echo HJERTERYTMEOVERVÅKNING |lt-proc -we nob.automorf.bin
Warning: matching case-sensitively since processor state size >= 65536
Warning: compoundAnalysis's MAX_COMBINATIONS exceeded for 'HJERTERYTMEOVERVÅKNING'
         gave up at char 15 'V'.
^HJERTERYTMEOVERVÅKNING/*HJERTERYTMEOVERVÅKNING$

$ echo hjerterytmeovervåkning |lt-proc -we nob.automorf.bin
^hjerterytmeovervåkning/hjerterytmeovervåkning<n><m><sg><ind>/hjerterytmeovervåkning<n><f><sg><ind>$

$ echo HJERTEOVERVÅKNING |lt-proc -we nob.automorf.bin
Warning: matching case-sensitively since processor state size >= 65536
Warning: compoundAnalysis's MAX_COMBINATIONS exceeded for 'HJERTEOVERVÅKNING'
         gave up at char 15 'N'.
^HJERTEOVERVÅKNING/*HJERTEOVERVÅKNING$

$ echo hjerteovervåkning |lt-proc -we nob.automorf.bin
^hjerteovervåkning/hjerte<n><nt><sg><ind><cmp>+overvåkning<n><m><sg><ind>/hjerte<n><nt><sg><ind><cmp>+overvåkning<n><f><sg><ind>$

It might lead to incomplete analyses, if e.g. «HJERTE» was in the dictionary as an <np> then we miss out on /HJERTE<np><cmp>+overvåkning<n><m><sg><ind>, but I don't think it would lead to (otherwise) wrong analyses.

@ftyers
Copy link
Member

ftyers commented Feb 17, 2025

I could see perhaps an issue getting compounds of proper names that we wouldn't want. But it might be a good idea to implement it, apply it to a corpus and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants