-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance DictionaryCompoundWordTokenFilter #14278
base: main
Are you sure you want to change the base?
Enhance DictionaryCompoundWordTokenFilter #14278
Conversation
… in order to prevent matches on sub-words
looks good to me. I wonder about the name of the parameter, maybe "greedy" would be more intuitive as a way to describe what it is doing? |
CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, | ||
CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, | ||
true, | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a case with:
longestMatch = false;
consumeChars = true;
If the combination doesn't make sense, lets just throw an IllegalArgumentException
in the constructor and have the test expectThrows()
that?
not saying "consumeChars" is a good name, happy to change it. but "greedy", as we know it from Regex, is something else, we are not capturing as much as possible, we are skipping forward on a match |
I'm not really opinionated on it, was just brainstorming because I had to look at the source code to figure out what the parameter was doing. And I agree, it is surprising behavior for |
for changing defaults, my goto would be, if we could do that as a followup PR, for a major release. We can expose this parameter in a minor release without hurting anyone, but if we change the default it could cause some reindexing for users. |
I would argue, at least in German, nothing but longestMatch=true and skipping forward does make any sense. Without skipping forward the filter extracts a lot of nonsense and in my opinion is unusable, at least in German. I've seen this Filter being dropped from projects because of that unexpected behavior and at first I thought this is a bug. |
Yes, I'm just suggesting to split it. We can add this new parameter here, backport to minor release 10.2.0, no breaking changes. Separately we can default it to |
Adding option to consume characters if a matching word is found, and not used for further potential matches anymore. E.g. if the word "schwein" is extracted, the sub-word "wein" is not extracted anymore.