-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expression boundary analysis framework #3912
Conversation
f4050c9
to
d95ce05
Compare
Another interesting point that this PR might need is probably a specification / design document, @alamb what would work best for you (I can try to add a couple more example analysis, like |
I would personally recommend starting with a google doc if you can do that (as I have found it is easiest to quickly iterate with them) and then we convert that to a document in https://arrow.apache.org/datafusion/contributor-guide/specification/index.html |
We are also considering a range optimization on hash join, on queries like SELECT ...
FROM left, right
WHERE left.x > right.y AND
left.x + left.y < right.x + 10 AND
left.z = left.x and it paves the way for a streaming execution of the |
@metesynnada I've started a draft of the whole CBOs and statistics (including the expression analysis framework) on a google doc, would love to hear more on what sort of stuff range optimization would need and how we could reshape this API to be able to easily used for that. |
A couple of questions:
|
Also CC: @Dandandan @andygrove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! @isidentical -- I think this is looking quite good. I left some comments about the interface -- let us know what you think
How about
I agree |
Like the example in here. |
Why this SQL can run |
Same question. For this non-equal join, how can we apply hash join? |
Can we merge this PR first ? Then we can continue and do some experimentations with those APIs. |
@isidentical One quick question here, a complex binary expressions mixed of AND/OR expressions might generate multiple discrete intervals, make it looks like histograms actually. How can our API models this ? Column a boundaries [0, 100] Expressions: |
Another example is : |
Marking the PR draft until we reach a consensus on the API design. @alamb @Dandandan how do we feel about:
I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though. @mingmwang on the point of discrete intervals, yes it is a matter of histograms. Currently we would have a non-uniform distribution but would represent it as a uniform range ( |
I feel like we are somewhat stuck because the usecase (at least to me) of intermediate column analysis isn't 100% clear. To truly understand I think I need some example code showing how this would be used. Until then I can't seem to quite follow the proposals (maybe because I am too stuck in previous patterns I have seen) Here is what I propose:
|
@alamb I think that sounds like a good plan! The primary place where this would be essential (or at least, I'd interpret it as essential; but maybe we can find a simpler solution) is For this PR, I've dropped support for 'intermediate column updates': so |
Can we get this PR merged this week? |
Based my understanding, I think the selectivity and row count estimation is more important, the column boundaries estimation and derivation is error prone, especially after couple of complex expressions, or some advanced SQL operators, we might have to give up the column boundaries and just propagate the estimation of the row counts.
|
And for a simple expressions like a + 100 < 20, can we safely derive the boundary of a [min, -80) ? Probably not, considering a + 100 might overflow the type's boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @isidentical -- this looks like a good foundation to build on to me 👍
Perhaps as @mingmwang works with the code and analysis further improvements will become apparent
Benchmark runs are scheduled for baseline = 0f18e76 and contender = 01941aa. 01941aa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks everyone for working with me on this while we iterate the on the design for a few times! Hopefully the upcoming PRs would justify the APIs even more and opens up more use cases. If anyone is interested, I plan to add a few tickets for the parts that I am planning to work on (esp on the composite expressions) and if there is a special topic that interests you please let me know on #3898 so we can sync up. |
This PR implements the initial revision of the proposed (by #3898) expression boundary analysis framework. The changes can be summarized as:
Removed
expr_stats()
method fromPhysicalExpr
.Removed
PhysicalExprStats
.Introduced a new
boundaries()
method toPhysicalExpr
, which allows any expression to provide its own boundaries.Migrated the existing implementations (comparisons, literals, columns) from ~temporary expression statistics APIs to proposed boundary APIs.
Changed the implementation of comparisons to now return a boundary of
[true]
(when it is always true),[false]
(when it is always false), and[false, true]
(when there is a partial overlap). So the framework is consistent with all expressions.A shared context is now passed instead of a list of column boundaries. We intend to leverage this existing layout later in the process for composite expression chains.