-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
THREESCALE-10930: Searchd proxy_rule
index: Exclude URL delimiters from mapped chars
#4021
base: master
Are you sure you want to change the base?
Conversation
proxy_rule
index: Exclude URL delimiters from mapped chars
Rails validation forbids it, so no need to index it.
Nice @jlledom ! But I also don't quite understand the logic... On the other hand, this character set was previously added explicitly, in order to fix some other failing searches, I think: https://github.com/3scale/porta/pull/2138/files Also, we need to also take care about such characters as |
For reference, these are the patterns that are currently found in
I mean... why is |
This is very interesting. Awesome table @mayorova . Is it possible that with |
Yeah, I'm confused too...
I didn't know about that PR, but apparently the problem it solves is also solved in this branch. I tried searching for:
And everything seems to work as expected. The only maybe wrong case I found is searching for:
Which returns the 8 results that are exactly equal but also this additional result:
Which is not exact but I think it's pretty good anyway.
Yeah but removing them from index is not important I think. For instance, we have a pattern including |
The problem could be the token size. This one is not returned by master:
But this one, just one character shorter, is returned:
On the other hand, this one, much larger, is returned:
But that might be because the brackets are not indexed in However, this portion:
should be a token itself since all it's characters are indexed, but is larger than some of the results not returned. Soooo.... |
@jlledom , I mean that the pattern appears after the maximum word length. If word appears before, then the overall length not an issue. |
That makes sense. Maybe that's it? |
Can you give an example please @akostadinov ? @jlledom by the way, as we are at it, how about we bump
The other indexes have it set to 3. |
I just noticed from your examples that almost the same examples with 1 character or longer difference don't show up. In Joan's message there is an example with two values. I spent half an hour searching but didn't find any configuration option that would limit the word length. I see nothing in docs. I think increasing |
I increased it to 3: f45c029 I also updated the test suite: b74882b It was checking for some allowed characters that I removed. I also made small changes in what the tests index and search for, because I think the old way was not compatible with Now the tests index something like |
I mentioned that in this comment. I think it'll find the exact match, and maybe some other additional partial matches. It will use Then you search for |
So my question was whether it in fact does work like this. Can we have a test searching for something like "api/path1/path2/"? |
Don't quite understand the test you suggest. What would the test index, search for, and expect? Tell me and I'll write it. Edit: I think I know what you mean. I'll write it. |
Take a few from your example. e.g.
then search for |
@akostadinov I added the test you request, and in the process, I found that in fact it didn't work as I expected, so I had to make a few changes: aeb4d51 I remove the
So it added the stars to every keyword, and I don't know why, that always returned empty results. What I did is to enable the expand_keywords option which allows to search without stars and then add the stars on the fly. Now the query looks like this:
Now I wonder if I should enable On the other hand, I also made some modifications in the test suite: 2b275a4 I added the suggested test, but I wanted to verify the exact match was being return first. Unfortunately, we were using In order to not duplicate code, I extract the search options in |
Yeah, I wrote a test like that. With the new code, this is what your example returns: |
Yeah, I think exact words matching should be nice. Do we have a test for the simeple While on it, I don't understand what you mean with "add the stars on the fly". |
Also can we avoid reordering by activerecord? But maybe not in this PR, just asking. |
I'll add a test for that
If you search for |
It's not trivial I think, check the code: porta/app/queries/proxy_rule_query.rb Lines 15 to 22 in aeb4d51
Notice the call to porta/app/models/proxy_rule.rb Lines 23 to 25 in f9589a1
So it requires a bit of work to remove without breaking anything. |
Maybe we can add But for now it's fine. |
@akostadinov I added the test you requested: e4c3e0a
I guess you mean The UI allows to reorder by columns, but if not column has selected, then it orders by position ASC. That's a bit weird because the UI doesn't indicate the results are ordered by position, so it allows you to click on the position column... which doesn't change anything because it was already ordered. IMO, if the user searches for a pattern, it expects to get the best matching results first, so we should order by best match by default, and let the user reorder by columns if they want. I would do that in another PR anyway. |
Thanks. Will check it out! |
Definitely ugly hehe. If we wanted to order by best match, I'm sure Sphinx has some SQL syntax to do that, no need to do these tricky thing I guess. |
@@ -1,6 +1,10 @@ | |||
# frozen_string_literal: true | |||
|
|||
class ProxyRuleQuery | |||
DEFAULT_SEARCH_OPTIONS = { | |||
ids_only: true, per_page: ThreeScale::Search::Helpers::SPHINX_PAGE_SIZE_INFINITE, ignore_scopes: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that star: true
was removed, and replaced with set_property expand_keywords: 1
. But how is the behavior different between these two options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On manticore internals I got more questions than answers, but this is my guess:
Before, we searched for *fotos*
, and now we search for (fotos | *fotos* | =fotos)
(docs), so it will return more results than before because there are three clauses joined by OR
.
Now I wonder: why is fotos
not equal to *fotos*
? The only possible reason is because fotos
means "exact match" and *fotos*
means "partial match"; but then, why is fotos
not equal to =fotos
?
Mysteries of life.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but in a practical sense... if we keep star: true
and remove the index's setting expand_keywords: 1
- how does this affect the search results for our examples?
What this PR does / why we need it:
Some queries to manticore, for the
proxy_rule
index, are inconsistent and don't return all the matching results. In particular, the Jira issue includes an example with the keywordfotos
.I tried locally and reproduced the bug, for both sphinx and manticore.
I've been able to solve this by removing some allowed characters from the index. I'm not sure how this affects manticore internals but the truth is having the character
/
as allowed causes the bug.As I understood it, charset_table means the list of characters that are considered valid characters inside a search query. Characters not in
charset_table
will be considered "delimiters" and are ignored in queries.Since mapping rules are URLs, I think it's correct to accept only URL characters, and only those which are not reserved delimiters according to RFC 3986
According to this SO answer. Those characters are:
But from those, some are reserved delimiters, so I allowed these ones in our manticore index:
This solves the issue, though I don't entirely understand why it failed before. I mean,
/
was allowed as a character to be considered in searches, so a query like/fotos
should work if it actually matches the pattern, which was the problem with that? I looks correct to me.Which issue(s) this PR fixes
https://issues.redhat.com/browse/THREESCALE-10930
Verification steps
fotos
master
, it will return 10 results, on this branch, it will return all 16 results.