Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance DictionaryCompoundWordTokenFilter #14278

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

renatoh
Copy link

@renatoh renatoh commented Feb 22, 2025

Adding option to consume characters if a matching word is found, and not used for further potential matches anymore. E.g. if the word "schwein" is extracted, the sub-word "wein" is not extracted anymore.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2025

looks good to me. I wonder about the name of the parameter, maybe "greedy" would be more intuitive as a way to describe what it is doing?

CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE,
CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE,
true,
true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a case with:

longestMatch = false;
consumeChars = true;

If the combination doesn't make sense, lets just throw an IllegalArgumentException in the constructor and have the test expectThrows() that?

@renatoh
Copy link
Author

renatoh commented Feb 24, 2025

looks good to me. I wonder about the name of the parameter, maybe "greedy" would be more intuitive as a way to describe what it is doing?

not saying "consumeChars" is a good name, happy to change it. but "greedy", as we know it from Regex, is something else, we are not capturing as much as possible, we are skipping forward on a match
what do you think of "reuseChars" with a default of true?

@rmuir
Copy link
Member

rmuir commented Feb 24, 2025

I'm not really opinionated on it, was just brainstorming because I had to look at the source code to figure out what the parameter was doing.

And I agree, it is surprising behavior for longestMatch = true not to do this. But we can also add the parameter and separately decide about defaults? Because changing defaults could surprise existing users if we are not careful.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2025

for changing defaults, my goto would be, if we could do that as a followup PR, for a major release.

We can expose this parameter in a minor release without hurting anyone, but if we change the default it could cause some reindexing for users.

@renatoh
Copy link
Author

renatoh commented Feb 24, 2025

I would argue, at least in German, nothing but longestMatch=true and skipping forward does make any sense. Without skipping forward the filter extracts a lot of nonsense and in my opinion is unusable, at least in German. I've seen this Filter being dropped from projects because of that unexpected behavior and at first I thought this is a bug.
But with an additional parameter "reuseChars" with a default of false we would not change the default behavior.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2025

Yes, I'm just suggesting to split it. We can add this new parameter here, backport to minor release 10.2.0, no breaking changes. Separately we can default it to true for 11.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants