Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THREESCALE-10930: Searchd proxy_rule index: Exclude URL delimiters from mapped chars #4021

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

jlledom
Copy link
Contributor

@jlledom jlledom commented Feb 19, 2025

What this PR does / why we need it:

Some queries to manticore, for the proxy_rule index, are inconsistent and don't return all the matching results. In particular, the Jira issue includes an example with the keyword fotos.

I tried locally and reproduced the bug, for both sphinx and manticore.

I've been able to solve this by removing some allowed characters from the index. I'm not sure how this affects manticore internals but the truth is having the character / as allowed causes the bug.

As I understood it, charset_table means the list of characters that are considered valid characters inside a search query. Characters not in charset_table will be considered "delimiters" and are ignored in queries.

Since mapping rules are URLs, I think it's correct to accept only URL characters, and only those which are not reserved delimiters according to RFC 3986

According to this SO answer. Those characters are:

A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, =

But from those, some are reserved delimiters, so I allowed these ones in our manticore index:

A-Z, a-z, 0-9, -, ., _, ~, %

This solves the issue, though I don't entirely understand why it failed before. I mean, / was allowed as a character to be considered in searches, so a query like /fotos should work if it actually matches the pattern, which was the problem with that? I looks correct to me.

Which issue(s) this PR fixes

https://issues.redhat.com/browse/THREESCALE-10930

Verification steps

  1. Add the next mapping rules:
/biometria-servico/services/eleitoral/fotos
/biometria-servico/services/fotos/eleitoral
/bar/v1/foto/fotos
/bar/v1/something-else/fotos
/bar/v1/something/else/fotos
/bar/v1/someting/else/this/is/even/larger/than/the/one/failing/fotos
/biometria-servico/services/eleitora/fotos
/biometria-servico/services/eleitor/fotos
/biometria-servico/services/eleitoral/a/fotos
/biometria-servico/services/eleitorala/fotos
/biometria-servico/services/eleitorb/fotos
/biometria-servico/services/eleitorbl/fotos
/biometria-servico/services/ases/codObjetoAse/fotos-miniatura
/biometria-servico/services/ases/{codObjetoAse}/fotos-miniatura
/biometria-servico/services/ases/{codObjetoAse}/miniatura-fotos
/biometria-servico/services/eleitoral/biometria/v1/ase/{codObjetoAse}/miniatura-fotos
  1. Search for fotos
  2. On master, it will return 10 results, on this branch, it will return all 16 results.

@jlledom jlledom changed the title Exclude URL delimiters from mapped chars THREESCALE-10930: Searchd proxy_rule index: Exclude URL delimiters from mapped chars Feb 19, 2025
@jlledom jlledom self-assigned this Feb 19, 2025
Rails validation forbids it, so no need to index it.
@mayorova
Copy link
Contributor

Nice @jlledom !

But I also don't quite understand the logic...
I mean, OK, if / was considered a "valid character", probably, indeed searching for "fotos" would not return the desired results, because "fotos" would not be an indexed token, and could only be searched as part of a longer query.
But what I don't understand is - why would it still work for some paths then!??

On the other hand, this character set was previously added explicitly, in order to fix some other failing searches, I think: https://github.com/3scale/porta/pull/2138/files

Also, we need to also take care about such characters as {}, because, while they are not standard URLs, they are common in the mapping rules.

@mayorova
Copy link
Contributor

For reference, these are the patterns that are currently found in master:

Pattern master
/biometria-servico/services/eleitoral/fotos
/biometria-servico/services/fotos/eleitoral X
/bar/v1/foto/fotos X
/bar/v1/something-else/fotos X
/bar/v1/something/else/fotos X
/bar/v1/someting/else/this/is/even/larger/than/the/one/failing/fotos
/biometria-servico/services/eleitora/fotos X
/biometria-servico/services/eleitor/fotos X
/biometria-servico/services/eleitoral/a/fotos
/biometria-servico/services/eleitorala/fotos
/biometria-servico/services/eleitorb/fotos X
/biometria-servico/services/eleitorbl/fotos
/biometria-servico/services/ases/codObjetoAse/fotos-miniatura
/biometria-servico/services/ases/{codObjetoAse}/fotos-miniatura X
/biometria-servico/services/ases/{codObjetoAse}/miniatura-fotos X
/biometria-servico/services/eleitoral/biometria/v1/ase/{codObjetoAse}/miniatura-fotos X

I mean... why is /biometria-servico/services/eleitorb/fotos being found, but not /biometria-servico/services/eleitorala/fotos ??? How are they different in terms of tokenization?...

@akostadinov
Copy link
Contributor

This is very interesting. Awesome table @mayorova . Is it possible that with / some words become too long and thus are not fully tokenized?

@jlledom
Copy link
Contributor Author

jlledom commented Feb 19, 2025

Nice @jlledom !

But I also don't quite understand the logic... I mean, OK, if / was considered a "valid character", probably, indeed searching for "fotos" would not return the desired results, because "fotos" would not be an indexed token, and could only be searched as part of a longer query. But what I don't understand is - why would it still work for some paths then!??

Yeah, I'm confused too...

On the other hand, this character set was previously added explicitly, in order to fix some other failing searches, I think: https://github.com/3scale/porta/pull/2138/files

I didn't know about that PR, but apparently the problem it solves is also solved in this branch. I tried searching for:

{codObjetoAse}
/biometria-servico/services/eleitoral/biometria/v1/ase/{codObjetoAse}/miniatura-fotos
/fotos-miniatura
/bar/v1/

And everything seems to work as expected. The only maybe wrong case I found is searching for:

/biometria-servico/services/eleito

Which returns the 8 results that are exactly equal but also this additional result:

/biometria-servico/services/fotos/eleitoral

Which is not exact but I think it's pretty good anyway.

Also, we need to also take care about such characters as {}, because, while they are not standard URLs, they are common in the mapping rules.

Yeah but removing them from index is not important I think. For instance, we have a pattern including {codObjetoAse}. Then, if you search exactly for the same string {codObjetoAse} it will return the correct results because the brackets are ignored.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 19, 2025

Is it possible that with / some words become too long and thus are not fully tokenized?

The problem could be the token size.

This one is not returned by master:

/biometria-servico/services/eleitoral/fotos

But this one, just one character shorter, is returned:

/biometria-servico/services/eleitora/fotos

On the other hand, this one, much larger, is returned:

/biometria-servico/services/eleitoral/biometria/v1/ase/{codObjetoAse}/miniatura-fotos

But that might be because the brackets are not indexed in master, so that larger string probably generates several tokens.

However, this portion:

/biometria-servico/services/eleitoral/biometria/v1/ase/

should be a token itself since all it's characters are indexed, but is larger than some of the results not returned.

Soooo.... ¯\_(ツ)_/¯

@akostadinov
Copy link
Contributor

@jlledom , I mean that the pattern appears after the maximum word length. If word appears before, then the overall length not an issue.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 19, 2025

@jlledom , I mean that the pattern appears after the maximum word length. If word appears before, then the overall length not an issue.

That makes sense. Maybe that's it?

@mayorova
Copy link
Contributor

@jlledom , I mean that the pattern appears after the maximum word length. If word appears before, then the overall length not an issue.

Can you give an example please @akostadinov ?

@jlledom by the way, as we are at it, how about we bump min_infix_len ? I think this is the only index that has it set to 1, and I believe this setting is not even supported.
https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Wildcard_searching_settings#min_infix_len

The minimum allowed non-zero value is 2.

The other indexes have it set to 3.

@akostadinov
Copy link
Contributor

Can you give an example please @akostadinov ?

I just noticed from your examples that almost the same examples with 1 character or longer difference don't show up. In Joan's message there is an example with two values. I spent half an hour searching but didn't find any configuration option that would limit the word length. I see nothing in docs.

I think increasing min_infix_len makes sense. Maybe the thinking was that in apis we often have /p/whatever or /xz/whatever. Maybe that's why they set to 1. But still 1 character doesn't make sense especially if we allow /.

Update the list of indexed chars and search for proper tokens
@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

@jlledom by the way, as we are at it, how about we bump min_infix_len ?

The other indexes have it set to 3.

I increased it to 3: f45c029

I also updated the test suite: b74882b

It was checking for some allowed characters that I removed. I also made small changes in what the tests index and search for, because I think the old way was not compatible with min_infix_len=3 and in general they make more sense now IMO.

Now the tests index something like /path/to/prefix_suffix and then search for prefix_suffix which should be a token if _ is an allowed character.

@akostadinov
Copy link
Contributor

@jlledom but still / is not used. Will the exact search like in #2138 work now?

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

@jlledom but still / is not used. Will the exact search like in #2138 work now?

I mentioned that in this comment. I think it'll find the exact match, and maybe some other additional partial matches.

It will use / as a separator for tokens, for instance, we index /path/to/resource, then it will ignore the slashes and tokenize path, to and resource.

Then you search for /path/to/resource and it will do the same, ignore the slashes and look for patterns containing path, to and resource, it will find the exact match /path/to/resource because it's the best match, but it might add some other results like /path/resource/to. See my example in the comment I linked above.

@akostadinov
Copy link
Contributor

So my question was whether it in fact does work like this. Can we have a test searching for something like "api/path1/path2/"?

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

So my question was whether it in fact does work like this. Can we have a test searching for something like "api/path1/path2/"?

Don't quite understand the test you suggest. What would the test index, search for, and expect? Tell me and I'll write it.

Edit: I think I know what you mean. I'll write it.

@akostadinov
Copy link
Contributor

Take a few from your example. e.g.

/bar/v1/something-else/fotos
/bar/v1/something/else/fotos
/bar/v1/someting/else/this/is/even/larger/than/the/one/failing/fotos
/biometria-servico/services/eleitora/fotos
/biometria-servico/services/eleitor/fotos

then search for /one/failing/fotos

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

@akostadinov I added the test you request, and in the process, I found that in fact it didn't work as I expected, so I had to make a few changes: aeb4d51

I remove the star: true options because it transformed the query into something like this:

SELECT * FROM `proxy_rule_core` WHERE MATCH('\\/*bar*\\/*v1*\\/*foto*\\/*fotos*')

So it added the stars to every keyword, and I don't know why, that always returned empty results.

What I did is to enable the expand_keywords option which allows to search without stars and then add the stars on the fly. Now the query looks like this:

SELECT * FROM `proxy_rule_core` WHERE MATCH('\\/bar\\/v1\\/foto\\/fotos')

Now I wonder if I should enable index_exact_words, see this section in the docs, WDYT?. IMO results seem to be pretty solid already.

On the other hand, I also made some modifications in the test suite: 2b275a4

I added the suggested test, but I wanted to verify the exact match was being return first. Unfortunately, we were using ProxyRuleQuery to get the results, which alters the order returned from Sphinx. Since this test suite is for the index itself, I thought we better bypass ProxyRuleQuery query sphinx directly from tests, this way we can verify the order of results given by Sphinx. We already have another test suite for ProxyRuleQuery anyway.

In order to not duplicate code, I extract the search options in ProxyRuleQuery to a constant which I read from the tests.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

Take a few from your example. e.g.

/bar/v1/something-else/fotos
/bar/v1/something/else/fotos
/bar/v1/someting/else/this/is/even/larger/than/the/one/failing/fotos
/biometria-servico/services/eleitora/fotos
/biometria-servico/services/eleitor/fotos

then search for /one/failing/fotos

Yeah, I wrote a test like that. With the new code, this is what your example returns:

image

@akostadinov
Copy link
Contributor

Yeah, I think exact words matching should be nice. Do we have a test for the simeple fotos use case for example? Whether it still matches partial?

While on it, I don't understand what you mean with "add the stars on the fly".

@akostadinov
Copy link
Contributor

Also can we avoid reordering by activerecord? But maybe not in this PR, just asking.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

Yeah, I think exact words matching should be nice. Do we have a test for the simeple fotos use case for example? Whether it still matches partial?

I'll add a test for that

While on it, I don't understand what you mean with "add the stars on the fly".

If you search for fotos, it was transformed to *fotos* in the query. After the change, now the query is not transformed, is still fotos, but manticore will internally interpret that as (fotos | *fotos* | =fotos). See the docs.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 20, 2025

Also can we avoid reordering by activerecord? But maybe not in this PR, just asking.

It's not trivial I think, check the code:

def search_for(query, scope = ProxyRule.all)
scope = scope.order_by(@sort, @direction).includes(:metric)
return scope if query.blank?
options = DEFAULT_SEARCH_OPTIONS.merge(with: { owner_type: @owner_type, owner_id: @owner_id })
ids = ProxyRule.search(ThinkingSphinx::Query.escape(query), options)
scope.where(id: ids, owner_type: @owner_type, owner_id: @owner_id)
end

Notice the call to order_by. No matter which scope you set, it will always add the ORDER BY clause to the query. Even if @sort and @direction are nil, there are some default values hardcoded in the model:

self.allowed_sort_columns = %w[proxy_rules.http_method proxy_rules.pattern proxy_rules.last proxy_rules.position metrics.friendly_name]
self.default_sort_column = :position
self.default_sort_direction = :asc

So it requires a bit of work to remove without breaking anything.

@akostadinov
Copy link
Contributor

Maybe we can add reorder(...) whenever full text search was used, see https://stackoverflow.com/a/396771/520567

But for now it's fine.

@jlledom
Copy link
Contributor Author

jlledom commented Feb 24, 2025

@akostadinov I added the test you requested: e4c3e0a

Maybe we can add reorder(...) whenever full text search was used, see https://stackoverflow.com/a/396771/520567

I guess you mean FIELD(), right? I don't like it because it's probably a MySQL-only function.

The UI allows to reorder by columns, but if not column has selected, then it orders by position ASC. That's a bit weird because the UI doesn't indicate the results are ordered by position, so it allows you to click on the position column... which doesn't change anything because it was already ordered. IMO, if the user searches for a pattern, it expects to get the best matching results first, so we should order by best match by default, and let the user reorder by columns if they want.

I would do that in another PR anyway.

@akostadinov
Copy link
Contributor

Thanks. Will check it out!
just FYI, perhaps this is an ugly but more portable approach https://stackoverflow.com/a/9475755/520567

@jlledom
Copy link
Contributor Author

jlledom commented Feb 24, 2025

Thanks. Will check it out! just FYI, perhaps this is an ugly but more portable approach https://stackoverflow.com/a/9475755/520567

Definitely ugly hehe. If we wanted to order by best match, I'm sure Sphinx has some SQL syntax to do that, no need to do these tricky thing I guess.

@@ -1,6 +1,10 @@
# frozen_string_literal: true

class ProxyRuleQuery
DEFAULT_SEARCH_OPTIONS = {
ids_only: true, per_page: ThreeScale::Search::Helpers::SPHINX_PAGE_SIZE_INFINITE, ignore_scopes: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that star: true was removed, and replaced with set_property expand_keywords: 1. But how is the behavior different between these two options?

Copy link
Contributor Author

@jlledom jlledom Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On manticore internals I got more questions than answers, but this is my guess:

Before, we searched for *fotos*, and now we search for (fotos | *fotos* | =fotos) (docs), so it will return more results than before because there are three clauses joined by OR.

Now I wonder: why is fotos not equal to *fotos*? The only possible reason is because fotos means "exact match" and *fotos* means "partial match"; but then, why is fotos not equal to =fotos?

Mysteries of life.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but in a practical sense... if we keep star: true and remove the index's setting expand_keywords: 1 - how does this affect the search results for our examples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants